Scala MapReduce Framework giving type mismatch

I have a MapReduce framework in Scala that is based on several org.apache.hadoop libraries. It works great with a simple wordcount program. However, I want to apply it to something useful, and I ended up at the roadblock. I want to take a csv file (or whatever delimiter really is) and pass everything in the 1st column as a key, then count the frequency of the keys.

This code looks like this:

class WordCountMapper extends Mapper[LongWritable, Text, Text, LongWritable] with HImplicits {
  protected override def map(lnNumber: LongWritable, line: Text, context: Mapper[LongWritable, Text, Text, LongWritable]#Context): Unit = {
  line.split(",", -1)(0) foreach (context.write(_,1))  //Splits data
  }
}

      

The problem occurs in the 'line.split' code. When I try to compile it, I get an error:

found: char required :. org.apache.hadoop.io.Text

line.split ... should return a string that is passed in _ in the (_, 1) record, but because of soem it thinks it is char. I even added .toString to explicitly make it a string, but that didn't work either.

Any ideas are appreciated. Let me know what additional details I can provide.

Update:

Here is the list of imports:

import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapreduce.{Reducer, Job, Mapper}
import org.apache.hadoop.conf.{Configured}
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat
import scala.collection.JavaConversions._
import org.apache.hadoop.util.{ToolRunner, Tool}

      

Here is the build.sbt code:

import AssemblyKeys._ // put this at the top of the file

assemblySettings

organization := "scala"

name := "WordCount"

version := "1.0"

scalaVersion:= "2.11.2"

scalacOptions ++= Seq("-no-specialization", "-deprecation")

libraryDependencies ++= Seq("org.apache.hadoop" % "hadoop-client" % "1.2.1",
                        "org.apache.hadoop" % "hadoop-core" % "latest.integration" exclude ("hadoop-core", "org/apache/hadoop/hdfs/protocol/ClientDatanodeProtocol.class") ,
                        "org.apache.hadoop" % "hadoop-common" % "2.5.1",
                        "org.apache.hadoop" % "hadoop-mapreduce-client-core" % "2.5.1",
                        "commons-configuration" % "commons-configuration" % "1.9",
                        "org.apache.hadoop" % "hadoop-hdfs" % "latest.integration")


 jarName in assembly := "WordCount.jar"

 mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>
  {case s if s.endsWith(".class") => MergeStrategy.last
case s if s.endsWith(".xsd") => MergeStrategy.last
case s if s.endsWith(".dtd") => MergeStrategy.last
case s if s.endsWith(".xml") => MergeStrategy.last
case s if s.endsWith(".properties") => MergeStrategy.last
case x => old(x)
  }
}

      

+3


source to share


2 answers


I actually solved this by not using the notation and just by specifying the value in context.write. So instead of:

line.split(",", -1)(0) foreach (context.write(_,1))

      

I used:



context.write(line.split(",", -1)(0), 1)

      

I found an element online that said that sometime Scala gets confused about datatypes when using _, and recommends only explicitly specifying the value in place. Not sure if this is true, but in this case it solved the problem.

0


source


I'm guessing line

implicitly converted to String

here (thanks HImplicits

?). Then we have

line.split(",", -1)(0) foreach somethigOrOther

      



  • split the string into multiple lines - .split(...)

  • take zero of these lines - (0)

  • then move somethingOrOther

    over the characters of that line -foreach

This way you get yours char

.

0


source







All Articles