Hadoop - globally sort average and when it happens in MapReduce

I am using Hadoop streaming JAR for WordCount , I want to know how can I get the sort globally , according to the answer to another question in SO I found that when we only use one reducer we can get the global sort, but in my result with numReduceTasks=1

(one reducer) it is not sorted.

For example my input to mapper:

file 1: A long time ago in a galaxy far away.

file 2: Another Star Wars Episode

Result:

A 1

a 1

Star 1

back 1

for 1

far 2

away 1

time 1

Wars 1

long 1

Other 1

in 1

episode 1

galaxy 1

But this is not sorting around the world!

So what does Shuffle and Sort Sort and Global Sort mean ?

mapper code:

    #!/usr/bin/env python
    import sys
    for line in sys.stdin:  
    line = line.strip()    
    words = line.split()    
    for word in words:
        print '%s\t%s' % (word, 1)

      

reducer code:

#!/usr/bin/env python

import sys

word2count = {} 

for line in sys.stdin:

    line = line.strip()

    word, count = line.split('\t', 1)

    try:
        count = int(count)
    except ValueError:
        continue

    try:
        word2count[word] = word2count[word]+count
    except:
        word2count[word] = count

for word in word2count.keys():
    print '%s\t%s'% ( word, word2count[word] )

      

I am using this command to run it:

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-input /user/cloudera/input \
-output /user/cloudera/output_new_0 \
-mapper /home/cloudera/wordcount_mapper.py \
-reducer /home/cloudera/wordcount_reducer.py \
-numReduceTasks=1

      

+2


source to share





All Articles