Hadoop - globally sort average and when it happens in MapReduce
I am using Hadoop streaming JAR for WordCount , I want to know how can I get the sort globally , according to the answer to another question in SO I found that when we only use one reducer we can get the global sort, but in my result with numReduceTasks=1
(one reducer) it is not sorted.
For example my input to mapper:
file 1: A long time ago in a galaxy far away.
file 2: Another Star Wars Episode
Result:
A 1
a 1
Star 1
back 1
for 1
far 2
away 1
time 1
Wars 1
long 1
Other 1
in 1
episode 1
galaxy 1
But this is not sorting around the world!
So what does Shuffle and Sort Sort and Global Sort mean ?
mapper code:
#!/usr/bin/env python
import sys
for line in sys.stdin:
line = line.strip()
words = line.split()
for word in words:
print '%s\t%s' % (word, 1)
reducer code:
#!/usr/bin/env python
import sys
word2count = {}
for line in sys.stdin:
line = line.strip()
word, count = line.split('\t', 1)
try:
count = int(count)
except ValueError:
continue
try:
word2count[word] = word2count[word]+count
except:
word2count[word] = count
for word in word2count.keys():
print '%s\t%s'% ( word, word2count[word] )
I am using this command to run it:
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-input /user/cloudera/input \
-output /user/cloudera/output_new_0 \
-mapper /home/cloudera/wordcount_mapper.py \
-reducer /home/cloudera/wordcount_reducer.py \
-numReduceTasks=1
source to share
No one has answered this question yet
See similar questions:
or similar: