How can I convert Dataset [(String, Seq [String])] to Dataset [(String, String)]?
This is probably a simple problem, but I start my adventure with a spark.
Problem . I want to get the following structure (expected output) in spark mode. I now have the following structure.
title1, {word11, word12, word13 ...}
title2, {word12, word22, word23 ...}
Data is stored in Dataset [(String, Seq [String])]
Excluded Result I would like to get Tuple [word, title]
word11, {title1}
word12, {title1}
What I Do
1. Make (title, seq [word1, word2, word, 3])
docs.mapPartitions { iter =>
iter.map {
case (title, contents) => {
val textToLemmas: Seq[String] = toText(....)
(title, textToLemmas)
}
}
}
- I tried to use .map to convert my structure to Tuple but can't seem to do it.
- I tried to iterate over all the elements, but then I cannot return the type
Thanks for the answer.
source to share
I'm surprised no one has suggested a solution with Scala for understanding (which gets "desugared" before flatMap
and map
as in Yuval Itschakov's answer at compile time).
When you see the series flatMap
and map
(possibly with filter
), for Scala to understand.
So the following:
val result = dataSet.flatMap { case (title, words) => words.map((_, title)) }
is equivalent to the following:
val result = for {
(title, words) <- dataSet
w <- words
} yield (w, title)
After all, why do we like Scala's flexibility, right?
source to share