You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by "R. James Firby" <fi...@powerset.com> on 2007/03/09 01:29:56 UTC

Change in JobClient behavior may not be ideal

We finally upgraded our Hadoop install from 9.2 to 12.0.  It went pretty
smoothly.  Kudos to all.  However, one change in behavior of the JobClient
seems like a problem.

In our install we have several different clusters running at different
locations with different sizes and characteristics.  We have been submitting
jobs to these clusters from a separate, central point using JobClient.

In the past all we had to do was point JobClient at the right JobTracker and
submit the job.  The -jt flag to JobClient makes this simple.

However, now the JobClient computes the task splits at the central point
rather than at the JobTracker.  That step involves looking up the default
number of mapred tasks in the cluster configuration (ie. mapred.map.tasks).
But, unfortunately, the cluster configuration isn't available where we are
running the JobClient, it is available at the cluster.  In the past this
didn't matter because all the JobClient really needed from the configuration
was communication information.

For things to work right, we need to maintain a separate configuration for
every cluster at the central point and at every other place where we might
want to use JobClient.  It was much simpler when we could use a single
central config to submit jobs to all clusters.

It might be good to keep cluster specific configuration parameters from
being needed to submit a job using JobClient.

In addition, doing the splits in the JobClient lets a locally set
mapred.map.tasks value override the value set in hadoop-site.xml on the
cluster, which seems like a bug.

Jim Firby
Powerset Inc.

Re: Change in JobClient behavior may not be ideal

Posted by Owen O'Malley <ow...@yahoo-inc.com>.

On Mar 8, 2007, at 4:29 PM, R. James Firby wrote:

> However, now the JobClient computes the task splits at the central  
> point
> rather than at the JobTracker.  That step involves looking up the  
> default
> number of mapred tasks in the cluster configuration (ie.  
> mapred.map.tasks).
> But, unfortunately, the cluster configuration isn't available where  
> we are
> running the JobClient, it is available at the cluster.  In the past  
> this
> didn't matter because all the JobClient really needed from the  
> configuration
> was communication information.

The computation of the splits was moved from the job tracker to the  
client, to offload the job tracker and more importantly to remove the  
need to load the user code in the job tracker.

I agree that since the cluster size and composition are defined by  
the cluster, it would make sense to pass back the capacity of the  
cluster via the JobSubmissionProtocol like the name of the default  
file system is. (I created HADOOP-1100.) I would pull out the default  
values out of hadoop-default.xml for mapred.{map,reduce}.tasks and  
have JobConf return a number based on the cluster capacity if the  
user hasn't given a specific value.

> In addition, doing the splits in the JobClient lets a locally set
> mapred.map.tasks value override the value set in hadoop-site.xml on  
> the
> cluster, which seems like a bug.

Once the input splits are generated, the number of splits defines the  
number of maps. In my opinion, it is far less confusing to the users  
to have conf.getNumMapTasks() return the real number of maps rather  
than the original hint.

-- Owen