How do I extract a portion of a string into an RDD?
After a few transformations, this will be the RDD output I have:
( z287570731_serv80i:7:175 , 5:Re )
( p286274731_serv80i:6:100 , 138 )
( t219420679_serv37i:2:50 , 5 )
( v290380588_serv81i:12:800 , 144:Jo )
( z292902510_serv83i:4:45 , 5:Re )
Using this data as an input RDD, I would like to extract the value between two semicolons.
For example:
Input = ( z287570731_serv80i:7:175 , 5:Re )
Output = 7 (:7:)
This is how I try to do it
val processedRDD = tid.map{
case (inString, inInt) =>
val RegEx = """.*:([\d.]+):.*""".r
val table_level = RegEx.findFirstIn(inString)
}
processedRDD.collect().foreach(println)
This is the output I am getting:
() () () () () () ()
How to do it Spark-way?
source to share
Very good answers here, but I missed one that I believe can easily beat them all :) And that's why I love Scala - for its flexibility.
Decision
scala> val solution = rdd.
map { case (left, right) => left }.
map(_.split(":")).
map { case Array(_, takeMe, _) => takeMe }.
collect
solution: Array[String] = Array(7, 6, 2, 12, 4)
I believe the solution is unlikely to go for readability and understanding. He just says what he is doing (like a good poem).
Description
Below is your RDD (well formatted thanks to Spark SQL Dataset.show
).
scala> rdd.toDF.show(false)
+-------------------------+------+
|_1 |_2 |
+-------------------------+------+
|z287570731_serv80i:7:175 |5:Re |
|p286274731_serv80i:6:100 |138 |
|t219420679_serv37i:2:50 |5 |
|v290380588_serv81i:12:800|144:Jo|
|z292902510_serv83i:4:45 |5:Re |
+-------------------------+------+
// Compare to this assembler-like way and you understand why you should use Spark SQL for this
scala> rdd.foreach(println)
(z287570731_serv80i:7:175,5:Re)
(p286274731_serv80i:6:100,138)
(t219420679_serv37i:2:50,5)
(v290380588_serv81i:12:800,144:Jo)
(z292902510_serv83i:4:45,5:Re)
The first step is to remove the right column. Matching the FTW Pattern!
scala> rdd.map { case (left, right) => left }.foreach(println)
z292902510_serv83i:4:45
t219420679_serv37i:2:50
v290380588_serv81i:12:800
p286274731_serv80i:6:100
z287570731_serv80i:7:175
With a temporary RDD, you will split the lines using :
, as a delimiter and take the second word. Again Scala matching FTW pattern!
val oneColumnOnly = rdd.map { case (left, right) => left }
scala> oneColumnOnly.
map(_.split(":")). // <-- split
map { case Array(_, takeMe, _) => takeMe }. // <-- take the 2nd field
foreach(println)
6
12
4
2
7
source to share
The value of a compound expression with scope {}
is the last value of the scope itself.
Your last line in the pattern match to call map
val table_level = ...
, which is an assignment, and returns a ()
type Unit
.
you just shouldn't assign it to anything other than write an expression like
val processedRDD = tid.map{
case (inString, inInt) =>
val RegEx = """.*:([\d.]+):.*""".r
RegEx.findFirstIn(inString)
}
source to share
You can split the first element of the tuple into :
if it always will, and do another one map
to get the desired result.
val rdd = sc.parallelize(Array(( "z287570731_serv80i:7:175" , "5:Re" ),
( "p286274731_serv80i:6:100" , "138" ),
( "t219420679_serv37i:2:50" , "5" ),
( "v290380588_serv81i:12:800" , "144:Jo" ),
( "z292902510_serv83i:4:45" , "5:Re" ) ))
val mapped = rdd.map( x => x._1.split(":")(1) ).map( x => ":"+x+":")
mapped.collect()
res1: Array[String] = Array(:7:, :6:, :2:, :12:, :4:)
source to share