Python: splitting trajectories into steps

Question

Python: splitting trajectories into steps

I have paths created from movements between clusters such as:

user_id,trajectory
11011,[[[86], [110], [110]]
2139671,[[89], [125]]
3945641,[[36], [73], [110], [110]]
10024312,[[123], [27], [97], [97], [97], [110]]
14270422,[[0], [110], [174]]
14283758,[[110], [184]]
14317445,[[50], [88]]
14331818,[[0], [22], [36], [131], [131]]
14334591,[[107], [19]]
14373703,[[35], [97], [97], [97], [17], [58]]

I would like to split the multi-move paths into separate segments, but I'm not sure how.

Example:

14373703,[[35], [97], [97], [97], [17], [58]]

in

14373703,[[35,97], [97,97], [97,17], [17,58]]

The goal is to then use them as edges in NetworkX to analyze them as a graph and identify dense motions (edges) between individual clusters (nodes).

This is the code I used to create the toolpaths originally:

# Import Data
data = pd.read_csv('G:\Programming Projects\GGS 681\dmv_tweets_20170309_20170314_cluster_outputs.csv', delimiter=',', engine='python')
#print len(data),"rows"

# Create Data Fame
df = pd.DataFrame(data, columns=['user_id','timestamp','latitude','longitude','cluster_labels'])

# Filter Data Frame by count of user_id
filtered = df.groupby('user_id').filter(lambda x: x['user_id'].count()>1)
#filtered.to_csv('G:\Programming Projects\GGS 681\dmv_tweets_20170309_20170314_final_filtered.csv', index=False, header=True)

# Get a list of unique user_id values
uniqueIds = np.unique(filtered['user_id'].values)

# Get the ordered (by timestamp) coordinates for each user_id
output = [[id,filtered.loc[filtered['user_id']==id].sort_values(by='timestamp')[['cluster_labels']].values.tolist()] for id in uniqueIds]

# Save outputs as csv
outputs = pd.DataFrame(output)
#print outputs
headers = ['user_id','trajectory']
outputs.to_csv('G:\Programming Projects\GGS 681\dmv_tweets_20170309_20170314_cluster_moves.csv', index=False, header=headers)

If splitting in this way is possible, could it be completed during processing rather than after the fact? I would like to execute it on creation to eliminate any post-processing.

+3

python pandas graph networkx

Andrew R. 09 Apr '17 at 3:50

source to share

3 answers

My solution uses the magic of the pandas' function .apply()

. I believe this should work (I tested this on your sample data). Note that I have also added additional data points at the end for the case when there is only one move and when there is no movement.

# Python3.5
import pandas as pd 


# Sample data from post
ids = [11011,2139671,3945641,10024312,14270422,14283758,14317445,14331818,14334591,14373703,10000,100001]
traj = [[[86], [110], [110]],[[89], [125]],[[36], [73], [110], [110]],[[123], [27], [97], [97], [97], [110]],[[0], [110], [174]],[[110], [184]],[[50], [88]],[[0], [22], [36], [131], [131]],[[107], [19]],[[35], [97], [97], [97], [17], [58]],[10],[]]

# Sample frame
df = pd.DataFrame({'user_ids':ids, 'trajectory':traj})

def f(x):
    # Creates edges given list of moves
    if len(x) <= 1: return x
    s = [x[i]+x[i+1] for i in range(len(x)-1)]
    return s

df['edges'] = df['trajectory'].apply(lambda x: f(x))

Output:

print(df['edges'])

                                                edges  
0                             [[86, 110], [110, 110]]  
1                                         [[89, 125]]  
2                   [[36, 73], [73, 110], [110, 110]]  
3   [[123, 27], [27, 97], [97, 97], [97, 97], [97,...  
4                              [[0, 110], [110, 174]]  
5                                        [[110, 184]]  
6                                          [[50, 88]]  
7          [[0, 22], [22, 36], [36, 131], [131, 131]]  
8                                         [[107, 19]]  
9   [[35, 97], [97, 97], [97, 97], [97, 17], [17, ...  
10                                               [10]  
11                                                 []

As far as you can fit this into your pipeline, just put it in right after you get the column trajectory

(no matter if you load the data or after you do the necessary filtering).

+2

ralston 09 Apr At 4:28 am

source to share

If you zip

offset your path with yourself by one, you will get the desired result.

Code:

for id, traj in data.items():
    print(id, list([i[0], j[0]] for i, j in zip(traj[:-1], traj[1:])))

Test data:

data = {
    11011: [[86], [110], [110]],
    2139671: [[89], [125]],
    3945641: [[36], [73], [110], [110]],
    10024312: [[123], [27], [97], [97], [97], [110]],
    14270422: [[0], [110], [174]],
    14283758: [[110], [184]],
    14373703: [[35], [97], [97], [97], [17], [58]],
}

Results:

11011 [[86, 110], [110, 110]]
14373703 [[35, 97], [97, 97], [97, 97], [97, 17], [17, 58]]
3945641 [[36, 73], [73, 110], [110, 110]]
14283758 [[110, 184]]
14270422 [[0, 110], [110, 174]]
2139671 [[89, 125]]
10024312 [[123, 27], [27, 97], [97, 97], [97, 97], [97, 110]]

+2

Stephen Rauch 09 Apr 17 at 4:29

source to share

jezrael · Accepted Answer · 2017-04-09T05:36:15+0000

I think you can use groupby

using apply

and a custom function with zip

, to output a list of lists to the desired list:

Note

count

the function returns all no values NaN

if filtering on length

no NaN is better len

.

#filtering and sorting     
filtered = df.groupby('user_id').filter(lambda x: len(x['user_id'])>1)
filtered = filtered.sort_values(by='timestamp')

f = lambda x: [list(a) for a in zip(x[:-1], x[1:])]
df2 = filtered.groupby('user_id')['cluster_labels'].apply(f).reset_index()
print (df2)
    user_id                                     cluster_labels
0     11011                            [[86, 110], [110, 110]]
1   2139671                                        [[89, 125]]
2   3945641                  [[36, 73], [73, 110], [110, 110]]
3  10024312  [[123, 27], [27, 97], [97, 97], [97, 97], [97,...
4  14270422                             [[0, 110], [110, 174]]
5  14283758                                       [[110, 184]]
6  14373703  [[35, 97], [97, 97], [97, 97], [97, 17], [17, ...

A similar solution, filtering is the last step boolean indexing

:

filtered = filtered.sort_values(by='timestamp')

f = lambda x: [list(a) for a in zip(x[:-1], x[1:])]
df2 = filtered.groupby('user_id')['cluster_labels'].apply(f).reset_index()
df2 = df2[df2['cluster_labels'].str.len() > 0]
print (df2)
    user_id                                     cluster_labels
1     11011                            [[86, 110], [110, 110]]
2   2139671                                        [[89, 125]]
3   3945641                  [[36, 73], [73, 110], [110, 110]]
4  10024312  [[123, 27], [27, 97], [97, 97], [97, 97], [97,...
5  14270422                             [[0, 110], [110, 174]]
6  14283758                                       [[110, 184]]
7  14373703  [[35, 97], [97, 97], [97, 97], [97, 17], [17, ...

Python: splitting trajectories into steps

More articles: