Pyspark flatMap in pandas
Is there an operation in pandas that does the same thing as flatMap in pyspark?
An example of a flat map:
>>> rdd = sc.parallelize([2, 3, 4])
>>> sorted(rdd.flatMap(lambda x: range(1, x)).collect())
[1, 1, 1, 2, 2, 3]
So far I can think of apply
and then itertools.chain
, but I am wondering if there is a one-step solution.
source to share
There to hack. I often do something like
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': [[1, 2], [3, 4, 5]]})
In [3]: df['x'].apply(pd.Series).unstack().reset_index(drop=True)
Out[3]:
0 1
1 3
2 2
3 4
4 NaN
5 5
dtype: float64
The introduction NaN
is about what an intermediate object creates MultiIndex
, but for a lot of things you can just give it up:
In [4]: df['x'].apply(pd.Series).unstack().reset_index(drop=True).dropna()
Out[4]:
0 1
1 3
2 2
3 4
5 5
dtype: float64
This trick uses all of the pandas code, so I expect it to be quite efficient, although it might not like things like lists of very different sizes.
source to share
I suspect the answer is "no, not effective".
Pandas is not built for nested data like this. I suspect the case you are looking at in Pandas looks something like this:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': [[1, 2], [3, 4, 5]]})
In [3]: df
Out[3]:
x
0 [1, 2]
1 [3, 4, 5]
And what do you want something like the following
x
0 1
0 2
1 3
1 4
1 5
It is much more typical to normalize your data in Python before sending it to Pandas. If Pandas did it, then it could probably only run at slow Python speeds, not C speeds.
Typically, a small iteration of the data is performed before using the table calculation.
source to share