Spark: convert row column to array
How do I convert a column that was read as a string to a column of arrays? those. conversion from diagram below
scala> test.printSchema
root
|-- a: long (nullable = true)
|-- b: string (nullable = true)
+---+---+
| a| b|
+---+---+
| 1|2,3|
+---+---+
| 2|4,5|
+---+---+
To:
scala> test1.printSchema
root
|-- a: long (nullable = true)
|-- b: array (nullable = true)
| |-- element: long (containsNull = true)
+---+-----+
| a| b |
+---+-----+
| 1|[2,3]|
+---+-----+
| 2|[4,5]|
+---+-----+
Please share both scala and python implementation if possible. As for the associated note, how do I take care of this while reading from the file itself? I have data with ~ 450 columns and I want to specify some of them in this format. I am currently reading in pyspark as below:
df = spark.read.format('com.databricks.spark.csv').options(
header='true', inferschema='true', delimiter='|').load(input_file)
Thank.
source to share
There are various methods
The best way to do this is to use a function split
and cast toarray<long>
data.withColumn("b", split(col("b"), ",").cast("array<long>"))
You can also create a simple udf to convert values
val tolong = udf((value : String) => value.split(",").map(_.toLong))
data.withColumn("newB", tolong(data("b"))).show
Hope this helps!
source to share
Using UDF will give you the exact schema you require. Like this:
val toArray = udf((b: String) => b.split(",").map(_.toLong))
val test1 = test.withColumn("b", toArray(col("b")))
This will give you a schematic like this:
scala> test1.printSchema
root
|-- a: long (nullable = true)
|-- b: array (nullable = true)
| |-- element: long (containsNull = true)
+---+-----+
| a| b |
+---+-----+
| 1|[2,3]|
+---+-----+
| 2|[4,5]|
+---+-----+
As far as applying the schema in the actual reading of the file, I find it a tricky task. So, for now, you can apply the transform after creating DataFrameReader
from test
.
Hope this helps!
source to share