Hadoop map code entry with hexadecimal values
I have a list of tweets as input to hdfs and am trying to do a map shrink task. This is my mapper implementation:
@Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
try {
String[] fields = value.toString().split("\t");
StringBuilder sb = new StringBuilder();
for (int i = 1; i < fields.length; i++) {
if (i > 1) {
sb.append("\t");
}
sb.append(fields[i]);
}
tid.set(fields[0]);
content.set(sb.toString());
context.write(tid, content);
} catch(DecoderException e) {
e.printStackTrace();
}
}
As you can see, I tried to split the input into "\ t", but the input (value.toString ()) looks like this when I print it out:
2014\x091880284777\x09argento_un\x090\x090\x09RT @topmusic619: #RETWEET THIS!!!!!\x5CnFOLLOW ME &
; EVERYONE ELSE THAT RETWEETS THIS FOR 35+ FOLLOWERS\x5Cn#TeamFollowBack #Follow2BeFollowed #TajF\xE2\x80\xA6
here's another example:
2014\x0934447260\x09RBEKP\x090\x090\x09\xE2\x80\x9C@LENEsipper: Wild lmfaooo RT @Yerrp08: L**o some
n***a nutt up while gettin twerked
I noted that there \x09
should be a tab character (ASCII 09 - tab), so I tried using apache Hex
:
String tmp = value.toString();
byte[] bytes = Hex.decodeHex(tmp.toCharArray());
But the function decodeHex
returns null.
This is weird as some of the characters are in hex and others are not. How can I decode them?
Edit: Also note that apart from tab
, emojis
also encoded as hex values.
source to share
No one has answered this question yet
Check out similar questions: