Scikit-learn creating a labeled dataset from sharded time series

Question

Scikit-learn creating a labeled dataset from sharded time series

INTRO

I have Pandas DataFrame, which represents a segment of the segmented different users (ie user1 and user2). I want to train a scikit-learn classifier with specified DataFrames, but I cannot figure out the shape of the scikit-learn dataset that I have to create. Since my series is segmented, my DataFrame has a "segID" column that contains the specific segment ids. I will skip the segmentation description as it is provided by the algorithm .

Let's take an example where both user1 and user2 have 2 segments: print df

        username  voltage        segID  
0       user1     -0.154732      0  
1       user1     -0.063169      0  
2       user1      0.554732      1  
3       user1     -0.641311      1  
4       user1     -0.653732      1  
5       user2      0.446469      0  
6       user2     -0.655732      0  
7       user2      0.646769      0  
8       user2     -0.646369      1  
9       user2      0.257732      1  
10      user2     -0.346369      1

QUESTIONS:

The scikit-learn dataset API says to create a file containing the data and the target, but how can I shape my data since they are segments and not just a list?

I can't seem to figure out that my segments fit into the structure n_samples * n_features

. I have two ideas:

1) each data sample is a list representing a segment, on the other hand, the purpose for each data record is different as they are grouped. How about target names? Could this work?

{
    'data': array([
        [[-0.154732, -0.063169]],
        [[ 0.554732, -0.641311, -0.653732],
        [[ 0.446469, -0.655732, 0.646769]],
        [[-0.646369, 0.257732, -0.346369]]
        ]), 
    'target': 
        array([0, 1, 2, 3]),
    'target_names': array(['user1seg1', 'user1seg2', 'user2seg1', 'user2seg2'], dtype='|S10')

}

2) is (just) the nparray returned df.values

. target contains segment IDs different for each user ... makes sense?

{
    'data': array([
        [-0.154732],
        [-0.063169],
        [ 0.554732],
        [-0.641311],
        [-0.653732],
        [ 0.446469],
        [-0.655732],
        [ 0.646769],
        [-0.646369],
        [ 0.257732],
        [-0.346369]
        ]), 
    'target': 
        array([0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3]),
    'target_names': array(['user1seg1', 'user1seg1', 'user1seg2', 'user1seg2', .....], dtype='|S10')
}

I think the main problem is that I cannot figure out what to use as labels ...

EDIT:

OK, that's clear ... the labels are given by my truth, they are just usernames. elyase's answer is exactly what I was looking for. To better formulate the problem, I'll explain the meaning here segID

. When recognizing time series, segmentation can be useful to highlight significant segments. During testing, I want to recognize segments, not an entire series, because the series is quite long and the segments need to be meaningful in my context.

Take a look at the following example for this implementation based on the " Online Algorithm for Segmenting Time Series ". Mine segID

is just a column representing the chunk id.

segmented time series

+3

python pandas scikit-learn dataset classification

Enrico rotundo 02 june 15 at 12:26

source to share

1 answer

elyase · Accepted Answer · 2015-06-03T12:45:30+0000

This is not trivial, and there can be several ways to formulate the problem for consumption by the ML algorithm. You should try them all and find how you get the best results.

As you've already found, you need two things: a matrix X of the shape n_samples * n_features

and a column vector of y

length 'n_samples'. Let's start with a goal y

.

Purpose:

How do you want to predict a user from a discrete pool of usernames, you have a classification problem, your target will be a vector with np.unique(y) == ['user1', 'user2', ...]

Functions

Your functions are information that you provide to the ML algorithm for each label / user / target. Unfortunately, most algorithms require this information to be of a fixed length, but variable length time series do not fit into this description. So if you want to stick with the classic algorithms, you need to somehow condense the time series information for the user into a fixed length vector. Some possibilities are mean, min, max, sum, first, last values, histogram, spectral power, etc. You will need to come up with the ones that make sense for your given problem.

So, if you ignore the information SegID

, your matrix X

will look like this:

y/features 
           min max ... sum 
user1      0.1 1.2 ... 1.1    # <-first time series for user 1
user1      0.0 1.3 ... 1.1    # <-second time series for user 1
user2      0.3 0.4 ... 13.0   # <-first time series for user 2

Since the SegID itself is a time series, you also need to encode it as fixed length information, such as histogram / count of all possible values, most frequent value, etc.

In this case, you will have:

y/features 
           min max ... sum segID_most_freq segID_min
user1      0.1 1.2 ... 1.1 1               1
user1      0.3 0.4 ... 13  2               1
user2      0.3 0.4 ... 13  5               3

The algorithm will consider this data and will "think": therefore, for user1, the minimum segID is always 1, so if I see that a user has a prediction time whose time series has a minimum identifier of 1, then it must be user1. If it's around 3, it's probably user2, etc.

Keep in mind that this is just a possible approach. It is sometimes useful to ask what information will I have during the prediction that will allow me to find the user I see, and why this information will lead to this user?

Scikit-learn creating a labeled dataset from sharded time series

INTRO

QUESTIONS:

EDIT:

More articles: