Scala MapReduce Framework giving type mismatch
I have a MapReduce framework in Scala that is based on several org.apache.hadoop libraries. It works great with a simple wordcount program. However, I want to apply it to something useful, and I ended up at the roadblock. I want to take a csv file (or whatever delimiter really is) and pass everything in the 1st column as a key, then count the frequency of the keys.
This code looks like this:
class WordCountMapper extends Mapper[LongWritable, Text, Text, LongWritable] with HImplicits {
protected override def map(lnNumber: LongWritable, line: Text, context: Mapper[LongWritable, Text, Text, LongWritable]#Context): Unit = {
line.split(",", -1)(0) foreach (context.write(_,1)) //Splits data
}
}
The problem occurs in the 'line.split' code. When I try to compile it, I get an error:
found: char required :. org.apache.hadoop.io.Text
line.split ... should return a string that is passed in _ in the (_, 1) record, but because of soem it thinks it is char. I even added .toString to explicitly make it a string, but that didn't work either.
Any ideas are appreciated. Let me know what additional details I can provide.
Update:
Here is the list of imports:
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapreduce.{Reducer, Job, Mapper}
import org.apache.hadoop.conf.{Configured}
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat
import scala.collection.JavaConversions._
import org.apache.hadoop.util.{ToolRunner, Tool}
Here is the build.sbt code:
import AssemblyKeys._ // put this at the top of the file
assemblySettings
organization := "scala"
name := "WordCount"
version := "1.0"
scalaVersion:= "2.11.2"
scalacOptions ++= Seq("-no-specialization", "-deprecation")
libraryDependencies ++= Seq("org.apache.hadoop" % "hadoop-client" % "1.2.1",
"org.apache.hadoop" % "hadoop-core" % "latest.integration" exclude ("hadoop-core", "org/apache/hadoop/hdfs/protocol/ClientDatanodeProtocol.class") ,
"org.apache.hadoop" % "hadoop-common" % "2.5.1",
"org.apache.hadoop" % "hadoop-mapreduce-client-core" % "2.5.1",
"commons-configuration" % "commons-configuration" % "1.9",
"org.apache.hadoop" % "hadoop-hdfs" % "latest.integration")
jarName in assembly := "WordCount.jar"
mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>
{case s if s.endsWith(".class") => MergeStrategy.last
case s if s.endsWith(".xsd") => MergeStrategy.last
case s if s.endsWith(".dtd") => MergeStrategy.last
case s if s.endsWith(".xml") => MergeStrategy.last
case s if s.endsWith(".properties") => MergeStrategy.last
case x => old(x)
}
}
source to share
I actually solved this by not using the notation and just by specifying the value in context.write. So instead of:
line.split(",", -1)(0) foreach (context.write(_,1))
I used:
context.write(line.split(",", -1)(0), 1)
I found an element online that said that sometime Scala gets confused about datatypes when using _, and recommends only explicitly specifying the value in place. Not sure if this is true, but in this case it solved the problem.
source to share
I'm guessing line
implicitly converted to String
here (thanks HImplicits
?). Then we have
line.split(",", -1)(0) foreach somethigOrOther
- split the string into multiple lines -
.split(...)
- take zero of these lines -
(0)
- then move
somethingOrOther
over the characters of that line -foreach
This way you get yours char
.
source to share