You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Foss User <fo...@gmail.com> on 2009/04/04 08:39:34 UTC

Question on distribution of classes and jobs

If I have written a WordCount.java job in this manner:

        conf.setMapperClass(Map.class);
        conf.setCombinerClass(Combine.class);
        conf.setReducerClass(Reduce.class);

So, you can see that three classes are being used here.  I have
packaged these classes into a jar file called wc.jar and I run it like
this:

$ bin/hadoop jar wc.jar WordCountJob

1) I want to know when the job runs in a 5 machine cluster, is the
whole JAR file distributed across the 5 machines or the individual
class files are distributed individually?

2) Also, let us say the number of reducers are 2 while the number of
mappers are 5. What happens in this case? How are the class files or
jar files distributed?

3) Are they distributed via RPC or HTTP?

Re: Question on distribution of classes and jobs

Posted by Aaron Kimball <aa...@cloudera.com>.

On Fri, Apr 3, 2009 at 11:39 PM, Foss User <fo...@gmail.com> wrote:

> If I have written a WordCount.java job in this manner:
>
>        conf.setMapperClass(Map.class);
>        conf.setCombinerClass(Combine.class);
>        conf.setReducerClass(Reduce.class);
>
> So, you can see that three classes are being used here.  I have
> packaged these classes into a jar file called wc.jar and I run it like
> this:
>
> $ bin/hadoop jar wc.jar WordCountJob
>
> 1) I want to know when the job runs in a 5 machine cluster, is the
> whole JAR file distributed across the 5 machines or the individual
> class files are distributed individually?


The whole jar.

>
>
> 2) Also, let us say the number of reducers are 2 while the number of
> mappers are 5. What happens in this case? How are the class files or
> jar files distributed?


It's uploaded into HDFS; specifically into a subdirectory of wherever you
configured mapred.system.dir.

>
>
> 3) Are they distributed via RPC or HTTP?


The client uses the HDFS protocol to inject its jar file into HDFS. Then all
the TaskTrackers retrieve it with the same protocol