You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Foss User <fo...@gmail.com> on 2009/04/04 08:39:34 UTC
Question on distribution of classes and jobs
If I have written a WordCount.java job in this manner:
conf.setMapperClass(Map.class);
conf.setCombinerClass(Combine.class);
conf.setReducerClass(Reduce.class);
So, you can see that three classes are being used here. I have
packaged these classes into a jar file called wc.jar and I run it like
this:
$ bin/hadoop jar wc.jar WordCountJob
1) I want to know when the job runs in a 5 machine cluster, is the
whole JAR file distributed across the 5 machines or the individual
class files are distributed individually?
2) Also, let us say the number of reducers are 2 while the number of
mappers are 5. What happens in this case? How are the class files or
jar files distributed?
3) Are they distributed via RPC or HTTP?
Re: Question on distribution of classes and jobs
Posted by Aaron Kimball <aa...@cloudera.com>.
On Fri, Apr 3, 2009 at 11:39 PM, Foss User <fo...@gmail.com> wrote:
> If I have written a WordCount.java job in this manner:
>
> conf.setMapperClass(Map.class);
> conf.setCombinerClass(Combine.class);
> conf.setReducerClass(Reduce.class);
>
> So, you can see that three classes are being used here. I have
> packaged these classes into a jar file called wc.jar and I run it like
> this:
>
> $ bin/hadoop jar wc.jar WordCountJob
>
> 1) I want to know when the job runs in a 5 machine cluster, is the
> whole JAR file distributed across the 5 machines or the individual
> class files are distributed individually?
The whole jar.
>
>
> 2) Also, let us say the number of reducers are 2 while the number of
> mappers are 5. What happens in this case? How are the class files or
> jar files distributed?
It's uploaded into HDFS; specifically into a subdirectory of wherever you
configured mapred.system.dir.
>
>
> 3) Are they distributed via RPC or HTTP?
The client uses the HDFS protocol to inject its jar file into HDFS. Then all
the TaskTrackers retrieve it with the same protocol