You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by jamal sasha <ja...@gmail.com> on 2013/04/02 23:38:07 UTC

Basic hadoop MR question

Hi,
 I have a quick question. I am trying to write MR code using python.
In the word count example:
http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/

The reducer..
Why cant in the reducer I can declare a ditionary (hashmap) whose key is
word and value is a list of count (1's here)

So something like:

data_dict = defaultdict(list)
for line in sys.stdin:
       tokens = line.split("\t")
       data_dict[tokens[0]].append(1)

for k,v in data_dict.items():
    print k,sum(v)

Also, in the reducer code mentioned in the link.. Why are the follwoing
lines needed:
# do not forget to output the last word if needed! if current_word == word:
print '%s\t%s' % (current_word, current_count)

THough the code is well commented.. :( My apologies for asking naive
questions.
THanks