How can I split a comma string and get n values ​​in Spark Scala dataframe?

How to take only 2 data from a column arraytype

in Spark Scala? I got type data val df = spark.sqlContext.sql("select col1, col2 from test_tbl")

.

I have the following data:

col1  | col2                              
---   | ---
a     | [test1,test2,test3,test4,.....]   
b     | [a1,a2,a3,a4,a5,.....]       

      

I want to get the following data:

col1| col2
----|----
a   | test1,test2
b   | a1,a2

      

When I do df.withColumn("test", col("col2").take(5))

, it doesn't work. It gives this error:

the value is not a member of org.apache.spark.sql.ColumnName

How can I get the data in the specified order?

+3


source to share


2 answers


Internally withColumn

you can call udf getPartialstring

for which you can use method slice

or take

as below example .

  import sqlContext.implicits._
  import org.apache.spark.sql.functions._

  val getPartialstring = udf((array : Seq[String], fromIndex : Int, toIndex : Int) 
   => array.slice(fromIndex ,toIndex ).mkString(",")) 

      

your caller will appear as

 df.withColumn("test",getPartialstring(col("col2"))

      



col("col2").take(5)

fails because column has no method take(..)

why your error message says

error: value take is not a member of org.apache.spark.sql.ColumnName

You can use udf approach to solve this problem.

+2


source


You can use the Column array function apply

to get each individual element up to a specific index and then build a new array using the function array

:

import spark.implicits._
import org.apache.spark.sql.functions._

// Sample data:
val df = Seq(
  ("a", Array("a1", "a2", "a3", "a4", "a5", "a6")),
  ("a", Array("b1", "b2", "b3", "b4", "b5")),
  ("c", Array("c1", "c2"))
).toDF("col1", "col2")

val n = 4
val result = df.withColumn("col2", array((0 until n).map($"col2"(_)): _*))

result.show(false)
// +----+--------------------+
// |col1|col2                |
// +----+--------------------+
// |a   |[a1, a2, a3, a4]    |
// |a   |[b1, b2, b3, b4]    |
// |c   |[c1, c2, null, null]|
// +----+--------------------+

      



Note that this will cause a "pad" to appear with a help null

for records with arrays less than n

.

+2


source







All Articles