How can I convert Dataset [(String, Seq [String])] to Dataset [(String, String)]?


This is probably a simple problem, but I start my adventure with a spark.

Problem . I want to get the following structure (expected output) in spark mode. I now have the following structure.

title1, {word11, word12, word13 ...}
title2, {word12, word22, word23 ...}

Data is stored in Dataset [(String, Seq [String])]

Excluded Result I would like to get Tuple [word, title]

word11, {title1}
word12, {title1}

What I Do
1. Make (title, seq [word1, word2, word, 3])

docs.mapPartitions { iter =>
  iter.map {
     case (title, contents) => {
        val textToLemmas: Seq[String] = toText(....)
        (title, textToLemmas)
     }
  }
}

      

  1. I tried to use .map to convert my structure to Tuple but can't seem to do it.
  2. I tried to iterate over all the elements, but then I cannot return the type

Thanks for the answer.

+3


source to share


3 answers


This should work:



val result = dataSet.flatMap { case (title, words) => words.map((_, title)) }

      

+3


source


Another solution is to call the function explode

like this:

import org.apache.spark.sql.functions.explode
dataset.withColumn("_2", explode("_2")).as[(String, String)]

      



Hope this helps you, best regards.

+2


source


I'm surprised no one has suggested a solution with Scala for understanding (which gets "desugared" before flatMap

and map

as in Yuval Itschakov's answer at compile time).

When you see the series flatMap

and map

(possibly with filter

), for Scala to understand.

So the following:

val result = dataSet.flatMap { case (title, words) => words.map((_, title)) }

      

is equivalent to the following:

val result = for {
  (title, words) <- dataSet
  w <- words
} yield (w, title)

      

After all, why do we like Scala's flexibility, right?

+1


source







All Articles