Does randomSplit return a copy or reference to the original rdd?

Suppose I have something like the code below

for idx in xrange(0, 10):
    train_test_split = training.randomSplit(weights=[0.75, 0.25])
    train_cv = train_test_split[0]
    test_cv = train_test_split[1]
    # scale train_cv and test_cv


by scaling train_cv

and test_cv

, will the original data be affected?


RDDs are immutable.

Hence, it is not really possible to "change" the RDDs just by converting them. So no, the original data will not be affected.



