You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Demian Kurejwowski <de...@yahoo.com.INVALID> on 2017/09/22 17:40:44 UTC

hadoop questions for a begginer

hi, i am learning hadoop and currently doing python map reduce tutorial.  i am trying to understand the difference of having a map and reduce  files.
i am assumingwhen we lunch the scripts.The mapper.py script goes to all the machines at the same time and all start printing at the same time, and then the reducer goes to the reducer jobs and reads the lines what is coming from the jobs in no particular order?
1 can i just do a script that -get the file put it in a temp file and then work with it? (i guess this defeat the hole purposes of hadoop right?)
2 when working with a map script do i always need to print as key, value?  or i can print what ever i want?   and in what order does that comes?  if i read all the files of a folder like the tutorial say, are they been read in a sequential order by all the workers?can i make the mapper just print the lines of the file, and let the reducer do the logic of what i want to accomplish?



Writing An Hadoop MapReduce Program In Python - Michael G. Noll

  
|  
|   
|   
|   |    |

   |

  |
|  
|   |  
Writing An Hadoop MapReduce Program In Python - Michael G. Noll
 Por Michael G. Noll How to write an Hadoop MapReduce program in Python with the Hadoop Streaming API  |   |

  |

  |

 

following this tutorial,  i found the way of getting the information was making a directory like this.the mapper.pyimport sys

for i in sys.stdin:
    line = i.strip()
    words = line.split()
    for word in words:
        print word + "\t" + str(1)the reducer.pyimport sys

dic_words = {}
for i in sys.stdin:
    line = i.strip()
    word, one_value = line.split("\t")
    word_value = dic_words.get(word, 0)
    dic_words[word] = word_value + 1

for key, value in dic_words.items():
    print key, str(value)
when i test it against a file works,  or just testing it locally works too.something easy. echo "bla ble bli bla" | python mapper.py | sort -k1,1 | python reducer.pyand i do getbla 2ble 1bli 1
(not sure why we need the sort, i guess that emulates how hadoops works?  maybe hadoop mappers run first and then they return a dictionary that the reducer can read?)
thanks guys, i know there are weird question =(

Re: hadoop questions for a begginer

Posted by Gurmukh Singh <gu...@yahoo.com.INVALID>.

Well, in actual job the input will be a file.

so, instead of:

echo "bla ble bli bla" | python mapper.py | sort -k1,1 | python reducer.py

you will have:

cat file.txt | python mapper.py | sort -k1,1 | python reducer.py

The file has to be on HDFS (keeping simple, it can be other 
filesystems), then mapper.py is the map task logic, which will be 
executed on the data file "file.txt".

Depending upon the size of "file.txt" (number of hdfs blocks it has), 
that many map tasks will run.

The output of all the map tasks will go to the reducer for a final out put.

you can run the same program in hadoop as a streaming job:


$ hadoop jar contrib/streaming/hadoop-*streaming*.jar -file 
/home/hadoop/mapper.py -mapper /home/hadoop/mapper.py -file 
/home/hadoop/reducer.py -reducer /home/hadoop/reducer.py -input 
/file.txt -output /output


The above is a very simple explanation, let me know if you have any 
further questions.



On 23/9/17 3:40 am, Demian Kurejwowski wrote:
> hi, i am learning hadoop and currently doing python map reduce 
> tutorial.  i am trying to understand the difference of having a map 
> and reduce  files.
>
> i am assumingwhen we lunch the scripts.
> The mapper.py script goes to all the machines at the same time and all 
> start printing at the same time, and then the reducer goes to the 
> reducer jobs and reads the lines what is coming from the jobs in no 
> particular order?
>
> 1 can i just do a script that -get the file put it in a temp file and 
> then work with it? (i guess this defeat the hole purposes of hadoop 
> right?)
>
> 2 when working with a map script do i always need to print as key, 
> value?  or i can print what ever i want?   and in what order does that 
> comes?  if i read all the files of a folder like the tutorial say, are 
> they been read in a sequential order by all the workers?
> can i make the mapper just print the lines of the file, and let the 
> reducer do the logic of what i want to accomplish?
>
>
>
>
> Writing An Hadoop MapReduce Program In Python - Michael G. Noll 
> <http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/>
>
>
> 	
>
>
> 	
>
>
>     Writing An Hadoop MapReduce Program In Python - Michael G. Noll
>
> Por Michael G. Noll
> How to write an Hadoop MapReduce program in Python with the Hadoop 
> Streaming API
> 	
>
> <http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/>
>
>
> following this tutorial,  i found the way of getting the information 
> was making a directory like this.
> the mapper.py
> import sys
>
> for iin sys.stdin:
>      line = i.strip()
>      words = line.split()
>      for wordin words:
>          print word +"\t" +str(1)
> the reducer.py
> import sys
>
> dic_words = {}
> for iin sys.stdin:
>      line = i.strip()
>      word, one_value = line.split("\t")
>      word_value = dic_words.get(word, 0)
>      dic_words[word] = word_value +1 for key, valuein dic_words.items():
>      print key, str(value)
>
> when i test it against a file works,  or just testing it locally works 
> too.
> something easy.
> echo "bla ble bli bla" | python mapper.py | sort -k1,1 | python reducer.py
> and i do get
> bla 2
> ble 1
> bli 1
>
> (not sure why we need the sort, i guess that emulates how hadoops 
> works?  maybe hadoop mappers run first and then they return a 
> dictionary that the reducer can read?)
>
> thanks guys, i know there are weird question =(
>