You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Praveen Sripati <pr...@gmail.com> on 2011/09/22 15:05:47 UTC

Regarding FIFO scheduler

Hi,

Lets assume that there are two jobs J1 (100 map tasks) and J2 (200 map
tasks) and the cluster has a capacity of 150 map tasks (15 nodes with 10 map
tasks per node) and Hadoop is using the default FIFO scheduler. If I submit
first J1 and then J2, will the jobs run in parallel or the job J1 has to be
completed before the job J2 starts.

I was reading 'Hadoop - The Definitive Guide'  and it says "Early versions
of Hadoop had a very simple approach to scheduling users’ jobs: they ran in
order of submission, using a FIFO scheduler. Typically, each job would use
the whole cluster, so jobs had to wait their turn."

Thanks,
Praveen

Re: Regarding FIFO scheduler

Posted by Praveen Sripati <pr...@gmail.com>.

Thanks, got the point. So, the shuffle and sort can happen in parallel even
before all the map tasks are completed, but the reduce happens only after
all the map tasks are complete.

Praveen

On Thu, Sep 22, 2011 at 7:13 PM, Joey Echeverria <jo...@cloudera.com> wrote:

> In most cases, your job will have more map tasks than map slots. You
> want the reducers to spin up at some point before all your maps
> complete, so that the shuffle and sort can work in parallel with some
> of your map tasks. I usually set slow start to 80%, sometimes higher
> if I know the maps are slow and they do a lot of filtering, so there
> isn't too much intermediate data.
>
> -Joey
>
> On Thu, Sep 22, 2011 at 6:38 AM, Praveen Sripati
> <pr...@gmail.com> wrote:
> > Joey,
> >
> > Thanks for the response.
> >
> > 'mapreduce.job.reduce.slowstart.completedmaps' is default set to 0.05 and
> > says 'Fraction of the number of maps in the job which should be complete
> > before reduces are scheduled for the job.'
> >
> > Shouldn't the map tasks be completed before the reduce tasks are kicked
> for
> > a particular job?
> >
> > Praveen
> >
> > On Thu, Sep 22, 2011 at 6:53 PM, Joey Echeverria <jo...@cloudera.com>
> wrote:
> >>
> >> The jobs would run in parallel since J1 doesn't use all of your map
> >> tasks. Things get more interesting with reduce slots. If J1 is an
> >> overall slower job, and you haven't configured
> >> mapred.reduce.slowstart.completed.maps, then J1 could launch a bunch
> >> of idle reduce tasks which would starve J2.
> >>
> >> In general, it's best to configure the slow start property and to use
> >> the fair scheduler or capacity scheduler.
> >>
> >> -Joey
> >>
> >> On Thu, Sep 22, 2011 at 6:05 AM, Praveen Sripati
> >> <pr...@gmail.com> wrote:
> >> > Hi,
> >> >
> >> > Lets assume that there are two jobs J1 (100 map tasks) and J2 (200 map
> >> > tasks) and the cluster has a capacity of 150 map tasks (15 nodes with
> 10
> >> > map
> >> > tasks per node) and Hadoop is using the default FIFO scheduler. If I
> >> > submit
> >> > first J1 and then J2, will the jobs run in parallel or the job J1 has
> to
> >> > be
> >> > completed before the job J2 starts.
> >> >
> >> > I was reading 'Hadoop - The Definitive Guide'  and it says "Early
> >> > versions
> >> > of Hadoop had a very simple approach to scheduling users’ jobs: they
> ran
> >> > in
> >> > order of submission, using a FIFO scheduler. Typically, each job would
> >> > use
> >> > the whole cluster, so jobs had to wait their turn."
> >> >
> >> > Thanks,
> >> > Praveen
> >> >
> >>
> >>
> >>
> >> --
> >> Joseph Echeverria
> >> Cloudera, Inc.
> >> 443.305.9434
> >
> >
>
>
>
> --
> Joseph Echeverria
> Cloudera, Inc.
> 443.305.9434
>

Re: Regarding FIFO scheduler

Posted by Joey Echeverria <jo...@cloudera.com>.

In most cases, your job will have more map tasks than map slots. You
want the reducers to spin up at some point before all your maps
complete, so that the shuffle and sort can work in parallel with some
of your map tasks. I usually set slow start to 80%, sometimes higher
if I know the maps are slow and they do a lot of filtering, so there
isn't too much intermediate data.

-Joey

On Thu, Sep 22, 2011 at 6:38 AM, Praveen Sripati
<pr...@gmail.com> wrote:
> Joey,
>
> Thanks for the response.
>
> 'mapreduce.job.reduce.slowstart.completedmaps' is default set to 0.05 and
> says 'Fraction of the number of maps in the job which should be complete
> before reduces are scheduled for the job.'
>
> Shouldn't the map tasks be completed before the reduce tasks are kicked for
> a particular job?
>
> Praveen
>
> On Thu, Sep 22, 2011 at 6:53 PM, Joey Echeverria <jo...@cloudera.com> wrote:
>>
>> The jobs would run in parallel since J1 doesn't use all of your map
>> tasks. Things get more interesting with reduce slots. If J1 is an
>> overall slower job, and you haven't configured
>> mapred.reduce.slowstart.completed.maps, then J1 could launch a bunch
>> of idle reduce tasks which would starve J2.
>>
>> In general, it's best to configure the slow start property and to use
>> the fair scheduler or capacity scheduler.
>>
>> -Joey
>>
>> On Thu, Sep 22, 2011 at 6:05 AM, Praveen Sripati
>> <pr...@gmail.com> wrote:
>> > Hi,
>> >
>> > Lets assume that there are two jobs J1 (100 map tasks) and J2 (200 map
>> > tasks) and the cluster has a capacity of 150 map tasks (15 nodes with 10
>> > map
>> > tasks per node) and Hadoop is using the default FIFO scheduler. If I
>> > submit
>> > first J1 and then J2, will the jobs run in parallel or the job J1 has to
>> > be
>> > completed before the job J2 starts.
>> >
>> > I was reading 'Hadoop - The Definitive Guide'  and it says "Early
>> > versions
>> > of Hadoop had a very simple approach to scheduling users’ jobs: they ran
>> > in
>> > order of submission, using a FIFO scheduler. Typically, each job would
>> > use
>> > the whole cluster, so jobs had to wait their turn."
>> >
>> > Thanks,
>> > Praveen
>> >
>>
>>
>>
>> --
>> Joseph Echeverria
>> Cloudera, Inc.
>> 443.305.9434
>
>



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434

Re: Regarding FIFO scheduler

Posted by Praveen Sripati <pr...@gmail.com>.

Joey,

Thanks for the response.

'mapreduce.job.reduce.slowstart.completedmaps' is default set to 0.05 and
says 'Fraction of the number of maps in the job which should be complete
before reduces are scheduled for the job.'

Shouldn't the map tasks be completed before the reduce tasks are kicked for
a particular job?

Praveen

On Thu, Sep 22, 2011 at 6:53 PM, Joey Echeverria <jo...@cloudera.com> wrote:

> The jobs would run in parallel since J1 doesn't use all of your map
> tasks. Things get more interesting with reduce slots. If J1 is an
> overall slower job, and you haven't configured
> mapred.reduce.slowstart.completed.maps, then J1 could launch a bunch
> of idle reduce tasks which would starve J2.
>
> In general, it's best to configure the slow start property and to use
> the fair scheduler or capacity scheduler.
>
> -Joey
>
> On Thu, Sep 22, 2011 at 6:05 AM, Praveen Sripati
> <pr...@gmail.com> wrote:
> > Hi,
> >
> > Lets assume that there are two jobs J1 (100 map tasks) and J2 (200 map
> > tasks) and the cluster has a capacity of 150 map tasks (15 nodes with 10
> map
> > tasks per node) and Hadoop is using the default FIFO scheduler. If I
> submit
> > first J1 and then J2, will the jobs run in parallel or the job J1 has to
> be
> > completed before the job J2 starts.
> >
> > I was reading 'Hadoop - The Definitive Guide'  and it says "Early
> versions
> > of Hadoop had a very simple approach to scheduling users’ jobs: they ran
> in
> > order of submission, using a FIFO scheduler. Typically, each job would
> use
> > the whole cluster, so jobs had to wait their turn."
> >
> > Thanks,
> > Praveen
> >
>
>
>
> --
> Joseph Echeverria
> Cloudera, Inc.
> 443.305.9434
>

Re: Regarding FIFO scheduler

Posted by Joey Echeverria <jo...@cloudera.com>.

The jobs would run in parallel since J1 doesn't use all of your map
tasks. Things get more interesting with reduce slots. If J1 is an
overall slower job, and you haven't configured
mapred.reduce.slowstart.completed.maps, then J1 could launch a bunch
of idle reduce tasks which would starve J2.

In general, it's best to configure the slow start property and to use
the fair scheduler or capacity scheduler.

-Joey

On Thu, Sep 22, 2011 at 6:05 AM, Praveen Sripati
<pr...@gmail.com> wrote:
> Hi,
>
> Lets assume that there are two jobs J1 (100 map tasks) and J2 (200 map
> tasks) and the cluster has a capacity of 150 map tasks (15 nodes with 10 map
> tasks per node) and Hadoop is using the default FIFO scheduler. If I submit
> first J1 and then J2, will the jobs run in parallel or the job J1 has to be
> completed before the job J2 starts.
>
> I was reading 'Hadoop - The Definitive Guide'  and it says "Early versions
> of Hadoop had a very simple approach to scheduling users’ jobs: they ran in
> order of submission, using a FIFO scheduler. Typically, each job would use
> the whole cluster, so jobs had to wait their turn."
>
> Thanks,
> Praveen
>

-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434