How can I convert Dataset [(String, Seq [String])] to Dataset [(String, String)]?

Question

How can I convert Dataset [(String, Seq [String])] to Dataset [(String, String)]?

This is probably a simple problem, but I start my adventure with a spark.

Problem . I want to get the following structure (expected output) in spark mode. I now have the following structure.

title1, {word11, word12, word13 ...}
title2, {word12, word22, word23 ...}

Data is stored in Dataset [(String, Seq [String])]

Excluded Result I would like to get Tuple [word, title]

word11, {title1}
word12, {title1}

What I Do
1. Make (title, seq [word1, word2, word, 3])

docs.mapPartitions { iter =>
  iter.map {
     case (title, contents) => {
        val textToLemmas: Seq[String] = toText(....)
        (title, textToLemmas)
     }
  }
}

I tried to use .map to convert my structure to Tuple but can't seem to do it.
I tried to iterate over all the elements, but then I cannot return the type

Thanks for the answer.

+3

scala apache-spark apache-spark-sql

meernet May 20 '17 at 12:52

source to share

3 answers

Another solution is to call the function explode

like this:

import org.apache.spark.sql.functions.explode
dataset.withColumn("_2", explode("_2")).as[(String, String)]

Hope this helps you, best regards.

+2

Haroun mohammedi May 20 '17 at 16:27

source to share

I'm surprised no one has suggested a solution with Scala for understanding (which gets "desugared" before flatMap

and map

as in Yuval Itschakov's answer at compile time).

When you see the series flatMap

and map

(possibly with filter

), for Scala to understand.

So the following:

val result = dataSet.flatMap { case (title, words) => words.map((_, title)) }

is equivalent to the following:

val result = for {
  (title, words) <- dataSet
  w <- words
} yield (w, title)

After all, why do we like Scala's flexibility, right?

+1

Jacek Laskowski May 20 '17 at 20:19

source to share

Yuval Itzchakov · Accepted Answer · 2017-05-20T12:58:27+0000

This should work:

val result = dataSet.flatMap { case (title, words) => words.map((_, title)) }

How can I convert Dataset [(String, Seq [String])] to Dataset [(String, String)]?

More articles: