How do I split cell values across multiple lines in a pandas dataframe?

Question

How do I split cell values across multiple lines in a pandas dataframe?

I have the following dataframe which was received using code:

     df1=df.groupby('id')['x,y'].apply(lambda x: rdp(x.tolist(), 5.0)).reset_index()

Refer here

The resulting resulting data frame:

      id          x,y
  0   1    [(0, 0), (1, 2)]
  1   2    [(1, 3), (1, 2)]
  2   3    [(2, 5), (4, 6)]

Is it possible to get something like this:

         id      x,y
     0   1      (0, 0)
     1   1      (1, 2)
     2   2      (1, 3)
     3   2      (1, 2)
     4   3      (2, 5)
     5   3      (4, 6)

Here the coordinate list resulting from the previous df is split into new lines relative to their respective ids.

+1

python pandas dataframe

Liza May 02 '17 at 6:06

source to share

2 answers

Calculating a new column 'id'
- We can use the pandas method str.len
  
  to quickly count the number of items in each item in the list. This is convenient because we can directly pass this result to the method repeat
  
  df1['id']
  
  , which will repeat each element by the corresponding sum of the passed lengths.
Calculating a new column 'x,y'
- Generally, I like to use np.concatenate
  
  to combine all subscriptions. However, in this case, the sub-lists are lists of tuples. np.concatenate
  
  will not treat them as lists of objects. So instead, I use a method sum
  
  that will use the base method sum
  
  on lists, which in turn will be concatenated.

`pandas`

if we stick pandas

we can keep the code cleaner
Use repeat

with str.len

andsum

pd.DataFrame({
        'id': df1['id'].repeat(df1['x,y'].str.len()),
        'x,y': df1['x,y'].sum()
    })

   id     x,y
0   1  (0, 0)
0   1  (1, 2)
1   2  (1, 3)
1   2  (1, 2)
2   3  (2, 5)
2   3  (4, 6)

`numpy`

we can speed up this approach by using basic numpy arrays and equivalent numpy methods
NOTE: this is equivalent logic!

pd.DataFrame({
        'id': df1['id'].values.repeat(df1['x,y'].str.len()),
        'x,y': df1['x,y'].values.sum()
    })

We can speed it up even further by skipping the method str.len

and calculating the lengths with the list.

pd.DataFrame({
        'id': df1['id'].values.repeat([len(w) for w in df1['x,y'].values.tolist()]),
        'x,y': df1['x,y'].values.sum()
    })

Timing tests

small data

%%timeit
pd.DataFrame({
        'id': df1['id'].values.repeat([len(w) for w in df1['x,y'].values.tolist()]),
        'x,y': df1['x,y'].values.sum()
    })
1000 loops, best of 3: 351 µs per loop

%%timeit
pd.DataFrame({
        'id': df1['id'].repeat(df1['x,y'].str.len()),
        'x,y': df1['x,y'].sum()
    })
1000 loops, best of 3: 590 µs per loop

%%timeit 
pd.DataFrame({'id': np.repeat(df1['id'].values, df1['x,y'].str.len()), 
                   'x,y': df1['x,y'].values.sum()})

1000 loops, best of 3: 498 µs per loop

big data

df1 = pd.concat([df1.head(3)] * 100, ignore_index=True)

%%timeit
pd.DataFrame({
        'id': df1['id'].values.repeat([len(w) for w in df1['x,y'].values.tolist()]),
        'x,y': df1['x,y'].values.sum()
    })
1000 loops, best of 3: 579 µs per loop

%%timeit
pd.DataFrame({
        'id': df1['id'].repeat(df1['x,y'].str.len()),
        'x,y': df1['x,y'].sum()
    })
1000 loops, best of 3: 841 µs per loop

%%timeit 
pd.DataFrame({'id': np.repeat(df1['id'].values, df1['x,y'].str.len()), 
                   'x,y': df1['x,y'].values.sum()})

1000 loops, best of 3: 704 µs per loop

+2

piRSquared May 02 '17 at 6:24

source to share

jezrael · Accepted Answer · 2017-05-02T06:11:25+0000

You can use the constructor DataFrame

with stack

:

df2 = pd.DataFrame(df1['x,y'].values.tolist(), index=df1['id'])
        .stack()
        .reset_index(level=1, drop=True)
        .reset_index(name='x,y')
print (df2)

   id     x,y
0   1  (0, 0)
1   1  (1, 2)
2   2  (1, 3)
3   2  (1, 2)
4   3  (2, 5)
5   3  (4, 6)

numpy

use numpy.repeat

with lengths

values str.len

, column x,y

is numpy.ndarray.sum

:

df2 = pd.DataFrame({'id': np.repeat(df1['id'].values, df1['x,y'].str.len()), 
                   'x,y': df1['x,y'].values.sum()})

print (df2)
   id     x,y
0   1  (0, 0)
0   1  (1, 2)
1   2  (1, 3)
1   2  (1, 2)
2   3  (2, 5)
2   3  (1, 9)
2   3  (4, 6)

Delay

In [54]: %timeit pd.DataFrame(df1['x,y'].values.tolist(), index=df1['id']).stack().reset_index(level=1, drop=True).reset_index(name='x,y')
1000 loops, best of 3: 1.49 ms per loop

In [55]: %timeit pd.DataFrame({'id': np.repeat(df1['id'].values, df1['x,y'].str.len()), 'x,y': df1['x,y'].values.sum()})
1000 loops, best of 3: 562 µs per loop

#piRSquared solution
In [56]: %timeit pd.DataFrame({'id': df1['id'].repeat(df1['x,y'].str.len()), 'x,y': df1['x,y'].sum() })
1000 loops, best of 3: 712 µs per loop

How do I split cell values ​​across multiple lines in a pandas dataframe?

pandas (adsbygoogle = window.adsbygoogle || []).push({});

numpy (adsbygoogle = window.adsbygoogle || []).push({});

Timing tests

More articles:

How do I split cell values across multiple lines in a pandas dataframe?

`pandas`

`numpy`