How to split rdd data in two into sparks?

Question

How to split rdd data in two into sparks?

I have data in Spark RDD and I want to split it in two with a scale such as 0.7. For example, if the RDD looks like this:

[1,2,3,4,5,6,7,8,9,10]

I want to split it into rdd1

:

 [1,2,3,4,5,6,7]

and rdd2

:

[8,9,10]

with a scale of 0.7. rdd1

and rdd2

must be random every time. I tried like this:

seed = random.randint(0,10000)
rdd1 = data.sample(False,scale,seed)
rdd2 = data.subtract(rdd1)

and it works sometimes, but when my data contains dict

I am facing some problems. For example, the data is as follows:

[{1:2},{3:1},{5:4,2;6}]

I get

TypeError: unhashable type: 'dict'

+3

python apache-spark pyspark rdd

user3077020 15 nov. '14 at 7:46

source to share

1 answer

zero323 · Answer 1 · 2015-09-30T04:17:21+0000

Both RDDs

rdd = sc.parallelize(range(10))
test, train = rdd.randomSplit(weights=[0.3, 0.7], seed=1)

test.collect()
## [4, 7, 8]

train.collect()
## [0, 1, 2, 3, 5, 6, 9]

and DataFrames

df = rdd.map(lambda x: (x, )).toDF(["x"])

test, train = df.randomSplit(weights=[0.3, 0.7])

specify a method randomSplit

that can be used here.

Notes

randomSplit

expressed using one filter

for each output RDD

. In the general case, it is impossible to obtain several RDDs

from one spark conversion. See fooobar.com/questions/264696 / ... for details .
You cannot use subtract

with dictionaries because internally it is expressed cogorup

, and because of that, objects must be hashable

. See Also List as key to shortcut ByKey PySpark

How to split rdd data in two into sparks?

More articles: