N gram scala conversion of the output set

def ngrams(n: Int, words: Array[String]) = {
// exclude 1-grams
(1 to n).map { i => words.sliding(i).toStream }
  .foldLeft(Stream[Array[String]]()) {
    (a, b) => a #::: b
  } }   
scala> val op2 =  ngrams(3, "how are you".split(" ")).foreach { x => println((x.mkString(" ")))}  
Output as :    
how
are
you
how are
are you
how are you
op2: Unit = ()

      

How to avoid the above Unit value , actually I want to convert them to Set, because of Unit = (), it doesn't work. So, can you please help in deducing what should be Install (how, you, how you, how you), thanks for the post How to generate n-grams in scala? ...

+3


source to share


2 answers


This is the type signature for op2

. You could do

  • remove assignment to Op2

ngrams(3, "how are you".split(" ")).foreach { x => println((x.mkString(" ")))}



  1. Change .foreach

    to .map

    and call op2

    for the result.

scala> val op2 = ngrams(3, "how are you".split(" ")).map { x => x.mkString(" ")}.toList

scala> op2

0


source


The short answer is that the return type foreach

is Unit

. So when you assign the output foreach

to op2

, the type op2

is Unit

and its value is ()

.

It sounds like you want to do the following:

  • calculate n-grams using a method ngrams

    ,
  • save Set

    n-grams before op2

    and
  • print all n-grams.

Let's start with the type of the method ngrams

:

(n: Int, words: Array[String]) => Stream[Array[String]]

      

It returns Stream

, which looks like it can be easily turned into Set

c toSet

:



ngrams(3, "how are you".split(" ")).toSet

However, this is dangerous because in scala, Array

equality is done by reference. It is much safer to turn yours Stream[Array[String]]

into Stream[List[String]]

to remove all duplicates (this assumes order matters for every n-gram):

val op2 = ngrams(3, "how are you".split(" ")).map(_.toList).toSet

It is now easy to print Set[List[String]]

just like you did Stream[Array[String]]

:

op2.foreach { x => println((x.mkString(" ")))}

Since the result ()

is a type Unit

, there is no reason to assign it to a variable.

+2


source







All Articles