Spark: convert row column to array

Question

Spark: convert row column to array

How do I convert a column that was read as a string to a column of arrays? those. conversion from diagram below

scala> test.printSchema
root
 |-- a: long (nullable = true)
 |-- b: string (nullable = true)

+---+---+
|  a|  b|
+---+---+
|  1|2,3|
+---+---+
|  2|4,5|
+---+---+

To:

scala> test1.printSchema
root
 |-- a: long (nullable = true)
 |-- b: array (nullable = true)
 |    |-- element: long (containsNull = true)

+---+-----+
|  a|  b  |
+---+-----+
|  1|[2,3]|
+---+-----+
|  2|[4,5]|
+---+-----+

Please share both scala and python implementation if possible. As for the associated note, how do I take care of this while reading from the file itself? I have data with ~ 450 columns and I want to specify some of them in this format. I am currently reading in pyspark as below:

df = spark.read.format('com.databricks.spark.csv').options(
    header='true', inferschema='true', delimiter='|').load(input_file)

Thank.

+9

scala apache-spark pyspark

Nikhil Utane June 22. 17 at 4:31

source to share

3 answers

Using UDF will give you the exact schema you require. Like this:

val toArray = udf((b: String) => b.split(",").map(_.toLong))

val test1 = test.withColumn("b", toArray(col("b")))

This will give you a schematic like this:

scala> test1.printSchema
root
 |-- a: long (nullable = true)
 |-- b: array (nullable = true)
 |    |-- element: long (containsNull = true)

+---+-----+
|  a|  b  |
+---+-----+
|  1|[2,3]|
+---+-----+
|  2|[4,5]|
+---+-----+

As far as applying the schema in the actual reading of the file, I find it a tricky task. So, for now, you can apply the transform after creating DataFrameReader

from test

.

Hope this helps!

+2

himanshuIIITian June 22. 17 at 4:47 am

source to share

In python (pyspark) this would be:

from pyspark.sql.types import *
from pyspark.sql.functions import col, split
test = test.withColumn(
        "b",
        split(col("b"), ",\s*").cast("array<int>").alias("ev")
 )

0

Ariana Bermúdez Apr 24 '18 at 16:30

source to share

Shankar koirala · Accepted Answer · 2017-06-22T04:40:37+0000

There are various methods

The best way to do this is to use a function split

and cast toarray<long>

data.withColumn("b", split(col("b"), ",").cast("array<long>"))

You can also create a simple udf to convert values

val tolong = udf((value : String) => value.split(",").map(_.toLong))

data.withColumn("newB", tolong(data("b"))).show

Hope this helps!

Spark: convert row column to array

More articles: