Spark RDD Deduplication to Increase RDD

Question

Spark RDD Deduplication to Increase RDD

I have a dataframe loaded from disk

df_ = sqlContext.read.json("/Users/spark_stats/test.json")

It contains 500k lines.
my script works fine for this size, but I want to test it, for example on 5mm lines, is there a way to duplicate df 9 times? (it doesn't matter to me duplicates in df)

I already use union but it is very slow (as I think it reads from disk all the time)

df = df_
for i in range(9): 
    df = df.union(df_)

Do you have any idea on how to do this?

thank

0

duplicates union pyspark

test.beta 07 June 17 at 15:51

source to share

1 answer

RyanW · Answer 1 · 2017-06-07T16:53:07+0000

You can use an explosion. It should only read from the raw disk:

from pyspark.sql.types import *
from pyspark.sql.functions import *

schema = StructType([StructField("f1", StringType()), StructField("f2", StringType())])

data = [("a", "b"), ("c", "d")]
rdd = sc.parallelize(data)
df = sqlContext.createDataFrame(rdd, schema)

# Create an array with as many values as times you want to duplicate the rows
dups_array = [lit(i) for i in xrange(9)]
duplicated = df.withColumn("duplicate", array(*dups_array)) \
               .withColumn("duplicate", explode("duplicate")) \
               .drop("duplicate")

Spark RDD Deduplication to Increase RDD

More articles: