MapReduce (Python) - How to sort the reducer output for a Top-N list?

I am new to MapReduce. Currently trying to complete the udacity course on Hadoop MapReduce.

I have a handler that parses the forum nodes and I get the tags associated with each node. My goal is to sort the first 10 tags.

Output example:

video   1
cs101   1
meta    1
bug     1
issues  1
nationalities   1
cs101   1
welcome 1
cs101   1
cs212   1
cs262   1
cs253   1
discussion      1
meta    1

      

It's pretty easy to add these to the reducer:

#!/usr/bin/python

import sys
import string

total = 0
oldKey = None

for line in sys.stdin:
    data_mapped = line.strip().split("\t")

    if(len(data_mapped) != 2):
        print "====================="
        print line.strip()
        print "====================="
        continue

    key, value = data_mapped

    if oldKey and oldKey != key:
        print total, "\t", oldKey
        oldKey = key;
        total = 0

    oldKey = key
    total += float(value)

if oldKey != None:
    print total, "\t", oldKey

      

Output:

1.0     application
1.0     board
1.0     browsers
1.0     bug
8.0     cs101
1.0     cs212
5.0     cs253
1.0     cs262
1.0     deadlines
1.0     digital
5.0     discussion
1.0     google-appengine
2.0     homework
1.0     html
1.0     hungarian
1.0     hw2-1
3.0     issues
2.0     jobs
2.0     lessons

      

I know the keys are sorted in the mapper output, so I just check if the keys are still the same tag. If not, I'll post the number of times the tag appears.

However, the problem is how to sort this list?

I can sort a list in python if I store all the information in a list or dictionary. However, this seems like a bad design decision, because if you have many different tags, you will lose memory.

+3


source to share


1 answer


I believe you can use the collections.Counter class here:

Example: (modified from your code)

#!/usr/bin/python

import sys
import collections

counter = collections.Counter()

for line in sys.stdin:
    k, v = line.strip().split("\t", 2)

    counter[k] += int(v)

print counter.most_common(10)

      



The class collections.Counter()

implements this particular use case and many other common use cases for counting things and collecting various statistics, etc.

8.3.1. Counter objects. The counter tool is designed to support handy and fast mascots. For example:

+3


source







All Articles