Pyspark flatMap in pandas

Is there an operation in pandas that does the same thing as flatMap in pyspark?

An example of a flat map:

>>> rdd = sc.parallelize([2, 3, 4])
>>> sorted(rdd.flatMap(lambda x: range(1, x)).collect())
[1, 1, 1, 2, 2, 3]

      

So far I can think of apply

and then itertools.chain

, but I am wondering if there is a one-step solution.

+3


source to share


3 answers


There to hack. I often do something like

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'x': [[1, 2], [3, 4, 5]]})

In [3]: df['x'].apply(pd.Series).unstack().reset_index(drop=True)
Out[3]:
0     1
1     3
2     2
3     4
4   NaN
5     5
dtype: float64

      

The introduction NaN

is about what an intermediate object creates MultiIndex

, but for a lot of things you can just give it up:



In [4]: df['x'].apply(pd.Series).unstack().reset_index(drop=True).dropna()
Out[4]:
0    1
1    3
2    2
3    4
5    5
dtype: float64

      

This trick uses all of the pandas code, so I expect it to be quite efficient, although it might not like things like lists of very different sizes.

+2


source


I suspect the answer is "no, not effective".

Pandas is not built for nested data like this. I suspect the case you are looking at in Pandas looks something like this:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'x': [[1, 2], [3, 4, 5]]})

In [3]: df
Out[3]: 
           x
0     [1, 2]
1  [3, 4, 5]

      

And what do you want something like the following



    x
0   1
0   2
1   3
1   4
1   5

      

It is much more typical to normalize your data in Python before sending it to Pandas. If Pandas did it, then it could probably only run at slow Python speeds, not C speeds.

Typically, a small iteration of the data is performed before using the table calculation.

+1


source


there are three steps to resolve this issue.

import pandas as pd
df = pd.DataFrame({'x': [[1, 2], [3, 4, 5]]})
df_new = df['x'].apply(pd.Series).unstack().reset_index().dropna()
df_new[['level_1',0]]`

      

image result

-1


source







All Articles