How do I split cell values across multiple lines in a pandas dataframe?
I have the following dataframe which was received using code:
df1=df.groupby('id')['x,y'].apply(lambda x: rdp(x.tolist(), 5.0)).reset_index()
Refer here
The resulting resulting data frame:
id x,y
0 1 [(0, 0), (1, 2)]
1 2 [(1, 3), (1, 2)]
2 3 [(2, 5), (4, 6)]
Is it possible to get something like this:
id x,y
0 1 (0, 0)
1 1 (1, 2)
2 2 (1, 3)
3 2 (1, 2)
4 3 (2, 5)
5 3 (4, 6)
Here the coordinate list resulting from the previous df is split into new lines relative to their respective ids.
source to share
You can use the constructor DataFrame
with stack
:
df2 = pd.DataFrame(df1['x,y'].values.tolist(), index=df1['id'])
.stack()
.reset_index(level=1, drop=True)
.reset_index(name='x,y')
print (df2)
id x,y
0 1 (0, 0)
1 1 (1, 2)
2 2 (1, 3)
3 2 (1, 2)
4 3 (2, 5)
5 3 (4, 6)
numpy
use numpy.repeat
with lengths
values str.len
, column x,y
is numpy.ndarray.sum
:
df2 = pd.DataFrame({'id': np.repeat(df1['id'].values, df1['x,y'].str.len()),
'x,y': df1['x,y'].values.sum()})
print (df2)
id x,y
0 1 (0, 0)
0 1 (1, 2)
1 2 (1, 3)
1 2 (1, 2)
2 3 (2, 5)
2 3 (1, 9)
2 3 (4, 6)
Delay
In [54]: %timeit pd.DataFrame(df1['x,y'].values.tolist(), index=df1['id']).stack().reset_index(level=1, drop=True).reset_index(name='x,y')
1000 loops, best of 3: 1.49 ms per loop
In [55]: %timeit pd.DataFrame({'id': np.repeat(df1['id'].values, df1['x,y'].str.len()), 'x,y': df1['x,y'].values.sum()})
1000 loops, best of 3: 562 µs per loop
#piRSquared solution
In [56]: %timeit pd.DataFrame({'id': df1['id'].repeat(df1['x,y'].str.len()), 'x,y': df1['x,y'].sum() })
1000 loops, best of 3: 712 µs per loop
source to share
- Calculating a new column
'id'
- We can use the pandas method
str.len
to quickly count the number of items in each item in the list. This is convenient because we can directly pass this result to the methodrepeat
df1['id']
, which will repeat each element by the corresponding sum of the passed lengths.
- We can use the pandas method
- Calculating a new column
'x,y'
- Generally, I like to use
np.concatenate
to combine all subscriptions. However, in this case, the sub-lists are lists of tuples.np.concatenate
will not treat them as lists of objects. So instead, I use a methodsum
that will use the base methodsum
on lists, which in turn will be concatenated.
- Generally, I like to use
pandas
if we stick pandas
we can keep the code cleaner
Use repeat
with str.len
andsum
pd.DataFrame({
'id': df1['id'].repeat(df1['x,y'].str.len()),
'x,y': df1['x,y'].sum()
})
id x,y
0 1 (0, 0)
0 1 (1, 2)
1 2 (1, 3)
1 2 (1, 2)
2 3 (2, 5)
2 3 (4, 6)
numpy
we can speed up this approach by using basic numpy arrays and equivalent numpy methods
NOTE: this is equivalent logic!
pd.DataFrame({
'id': df1['id'].values.repeat(df1['x,y'].str.len()),
'x,y': df1['x,y'].values.sum()
})
We can speed it up even further by skipping the method str.len
and calculating the lengths with the list.
pd.DataFrame({
'id': df1['id'].values.repeat([len(w) for w in df1['x,y'].values.tolist()]),
'x,y': df1['x,y'].values.sum()
})
Timing tests
small data
%%timeit
pd.DataFrame({
'id': df1['id'].values.repeat([len(w) for w in df1['x,y'].values.tolist()]),
'x,y': df1['x,y'].values.sum()
})
1000 loops, best of 3: 351 µs per loop
%%timeit
pd.DataFrame({
'id': df1['id'].repeat(df1['x,y'].str.len()),
'x,y': df1['x,y'].sum()
})
1000 loops, best of 3: 590 µs per loop
%%timeit
pd.DataFrame({'id': np.repeat(df1['id'].values, df1['x,y'].str.len()),
'x,y': df1['x,y'].values.sum()})
1000 loops, best of 3: 498 µs per loop
big data
df1 = pd.concat([df1.head(3)] * 100, ignore_index=True)
%%timeit
pd.DataFrame({
'id': df1['id'].values.repeat([len(w) for w in df1['x,y'].values.tolist()]),
'x,y': df1['x,y'].values.sum()
})
1000 loops, best of 3: 579 µs per loop
%%timeit
pd.DataFrame({
'id': df1['id'].repeat(df1['x,y'].str.len()),
'x,y': df1['x,y'].sum()
})
1000 loops, best of 3: 841 µs per loop
%%timeit
pd.DataFrame({'id': np.repeat(df1['id'].values, df1['x,y'].str.len()),
'x,y': df1['x,y'].values.sum()})
1000 loops, best of 3: 704 µs per loop
source to share