Explode array data into spark lines

Question

Explode array data into spark lines

I have a dataset like this:

FieldA    FieldB    ArrayField
1         A         {1,2,3}
2         B         {3,5}

I would like to explode data on an ArrayField so that the result looks like this:

FieldA    FieldB    ExplodedField
1         A         1
1         A         2
1         A         3
2         B         3
2         B         5

What I mean is that I want to create an output string for each array element in the ArrayField while keeping the values of the other fields.

How would you implement it in Spark. Please note that the input dataset is very large.

+20

apache-spark pyspark

Gluz 08 june 17 at 13:17

source to share

3 answers

You can use the explosion function Below is a simple example for your case

import org.apache.spark.sql.functions._
import spark.implicits._

  val data = spark.sparkContext.parallelize(Seq(
    (1, "A", List(1,2,3)),
    (2, "B", List(3, 5))
  )).toDF("FieldA", "FieldB", "FieldC")

    data.withColumn("ExplodedField", explode($"FieldC")).drop("FieldC")

Hope this helps!

+3

Shankar koirala 08 june 17 at 13:28

source to share

explode does exactly what you want. Docs:

http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.explode

Also, here's an example from another question using it:

fooobar.com/questions/2414693 / ...

+1

Ryan Widmaier 08 june 17 at 13:28

source to share

rogue-one · Accepted Answer · 2017-06-08T13:27:52+0000

The explode function should accomplish this.

pyspark version:

>>> df = spark.createDataFrame([(1, "A", [1,2,3]), (2, "B", [3,5])],["col1", "col2", "col3"])
>>> from pyspark.sql.functions import explode
>>> df.withColumn("col3", explode(df.col3)).show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
|   1|   A|   1|
|   1|   A|   2|
|   1|   A|   3|
|   2|   B|   3|
|   2|   B|   5|
+----+----+----+

Scala version

scala> val df = Seq((1, "A", Seq(1,2,3)), (2, "B", Seq(3,5))).toDF("col1", "col2", "col3")
df: org.apache.spark.sql.DataFrame = [col1: int, col2: string ... 1 more field]

scala> df.withColumn("col3", explode($"col3")).show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
|   1|   A|   1|
|   1|   A|   2|
|   1|   A|   3|
|   2|   B|   3|
|   2|   B|   5|
+----+----+----+

Explode array data into spark lines

More articles: