You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Rares Vernica <rv...@gmail.com> on 2009/08/27 03:52:45 UTC

control map to split assignment

Hello,

I wonder is there is a way to control how maps are assigned to splits
in order to balance the load across the cluster.

Here is a simplified example. I have tow types of inputs: "long" and
"short". Each input is in a different file and will be processed by a
single map task. Suppose the "long" inputs take 10s to process while
the "short" inputs take 3s to process. I have two "long" inputs and
two "short" inputs. My cluster has 2 nodes and each node can execute
only one map task at a time. A possible schedule of the tasks could be
the following:

Node 1: "long map", "short map" -> 10s + 3s = 13s
Node 2: "long map", "short map" -> 10s + 3s = 13s

So, my job will be done in 13s. Another possible schedule is:

Node 1: "long map" -> 10s
Node 2: "short map", "short map", "long map" -> 3s + 3s + 10s = 16s

And, my job will be done in 16s. Clearly, the first scheduling is better.

Is there a way to control how the schedule is build? If I can control
which inputs are processed first, I could schedule the "long" inputs
to be processed first and so they will be balanced across nodes and I
will end up with something similar to the first schedule.

I could configure the job so that a "long" input gets processed by
more that a map, and so end up balancing the work, but I noticed that
overall, this takes more time than a bad scheduling with only one map
per input.

Thanks!

Cheers,
Rares Vernica

Re: control map to split assignment

Posted by Alex Loddengaard <al...@cloudera.com>.
Hi Rares,

Unfortunately there isn't a way to control the scheduling of individual
tasks, at least as far as I know.  Might you be able to split this up into
two jobs: one for the "short" inputs; another for the "long" inputs?  Just a
thought.

Alex

On Wed, Aug 26, 2009 at 6:52 PM, Rares Vernica <rv...@gmail.com> wrote:

> Hello,
>
> I wonder is there is a way to control how maps are assigned to splits
> in order to balance the load across the cluster.
>
> Here is a simplified example. I have tow types of inputs: "long" and
> "short". Each input is in a different file and will be processed by a
> single map task. Suppose the "long" inputs take 10s to process while
> the "short" inputs take 3s to process. I have two "long" inputs and
> two "short" inputs. My cluster has 2 nodes and each node can execute
> only one map task at a time. A possible schedule of the tasks could be
> the following:
>
> Node 1: "long map", "short map" -> 10s + 3s = 13s
> Node 2: "long map", "short map" -> 10s + 3s = 13s
>
> So, my job will be done in 13s. Another possible schedule is:
>
> Node 1: "long map" -> 10s
> Node 2: "short map", "short map", "long map" -> 3s + 3s + 10s = 16s
>
> And, my job will be done in 16s. Clearly, the first scheduling is better.
>
> Is there a way to control how the schedule is build? If I can control
> which inputs are processed first, I could schedule the "long" inputs
> to be processed first and so they will be balanced across nodes and I
> will end up with something similar to the first schedule.
>
> I could configure the job so that a "long" input gets processed by
> more that a map, and so end up balancing the work, but I noticed that
> overall, this takes more time than a bad scheduling with only one map
> per input.
>
> Thanks!
>
> Cheers,
> Rares Vernica
>