Pyspark flatMap in pandas

Question

Pyspark flatMap in pandas

Is there an operation in pandas that does the same thing as flatMap in pyspark?

An example of a flat map:

>>> rdd = sc.parallelize([2, 3, 4])
>>> sorted(rdd.flatMap(lambda x: range(1, x)).collect())
[1, 1, 1, 2, 2, 3]

So far I can think of apply

and then itertools.chain

, but I am wondering if there is a one-step solution.

+3

pandas pyspark

GeauxEric June 26. 15 at 18:53

source to share

3 answers

santon · Answer 1 · 2015-12-31T00:27:21+0000

There to hack. I often do something like

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'x': [[1, 2], [3, 4, 5]]})

In [3]: df['x'].apply(pd.Series).unstack().reset_index(drop=True)
Out[3]:
0     1
1     3
2     2
3     4
4   NaN
5     5
dtype: float64

The introduction NaN

is about what an intermediate object creates MultiIndex

, but for a lot of things you can just give it up:

In [4]: df['x'].apply(pd.Series).unstack().reset_index(drop=True).dropna()
Out[4]:
0    1
1    3
2    2
3    4
5    5
dtype: float64

This trick uses all of the pandas code, so I expect it to be quite efficient, although it might not like things like lists of very different sizes.

MRocklin · Answer 2 · 2015-06-26T22:39:17+0000

I suspect the answer is "no, not effective".

Pandas is not built for nested data like this. I suspect the case you are looking at in Pandas looks something like this:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'x': [[1, 2], [3, 4, 5]]})

In [3]: df
Out[3]: 
           x
0     [1, 2]
1  [3, 4, 5]

And what do you want something like the following

It is much more typical to normalize your data in Python before sending it to Pandas. If Pandas did it, then it could probably only run at slow Python speeds, not C speeds.

Typically, a small iteration of the data is performed before using the table calculation.

nikita · Answer 3 · 2017-02-16T11:52:14+0000

there are three steps to resolve this issue.

import pandas as pd
df = pd.DataFrame({'x': [[1, 2], [3, 4, 5]]})
df_new = df['x'].apply(pd.Series).unstack().reset_index().dropna()
df_new[['level_1',0]]`

Pyspark flatMap in pandas

More articles: