How to split rdd data in two into sparks?
I have data in Spark RDD and I want to split it in two with a scale such as 0.7. For example, if the RDD looks like this:
[1,2,3,4,5,6,7,8,9,10]
I want to split it into rdd1
:
[1,2,3,4,5,6,7]
and rdd2
:
[8,9,10]
with a scale of 0.7. rdd1
and rdd2
must be random every time. I tried like this:
seed = random.randint(0,10000)
rdd1 = data.sample(False,scale,seed)
rdd2 = data.subtract(rdd1)
and it works sometimes, but when my data contains dict
I am facing some problems. For example, the data is as follows:
[{1:2},{3:1},{5:4,2;6}]
I get
TypeError: unhashable type: 'dict'
source to share
Both RDDs
rdd = sc.parallelize(range(10))
test, train = rdd.randomSplit(weights=[0.3, 0.7], seed=1)
test.collect()
## [4, 7, 8]
train.collect()
## [0, 1, 2, 3, 5, 6, 9]
and DataFrames
df = rdd.map(lambda x: (x, )).toDF(["x"])
test, train = df.randomSplit(weights=[0.3, 0.7])
specify a method randomSplit
that can be used here.
Notes
-
randomSplit
expressed using onefilter
for each outputRDD
. In the general case, it is impossible to obtain severalRDDs
from one spark conversion. See fooobar.com/questions/264696 / ... for details . -
You cannot use
subtract
with dictionaries because internally it is expressedcogorup
, and because of that, objects must behashable
. See Also List as key to shortcut ByKey PySpark
source to share