You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Mithila Nagendra <mn...@asu.edu> on 2009/08/13 19:44:41 UTC

Intermediary Data on Fair Scheduler

Hello All

When the fair scheduler switches between two jobs, what does it do with the
intermediary data? Does it dump the data/job states onto the disk (DFS)? Or
does it do a context switch (i.e. everything is in memory)? I was looking at
the scheduler for an application I'm working on, any pointers will be
appreciated!

Thanks!
Mithila Nagendra
Arizona State University

Re: Intermediary Data on Fair Scheduler

Posted by Mithila Nagendra <mn...@asu.edu>.

This helps a lot! Thank you Todd.

Best Regards
Mithila

On Thu, Aug 13, 2009 at 11:40 AM, Todd Lipcon <to...@cloudera.com> wrote:

> On Thu, Aug 13, 2009 at 11:32 AM, Mithila Nagendra <mn...@asu.edu>
> wrote:
>
> > Hi Todd
> >
> > So does this mean that when two jobs are assigned to a pool, where one
> job
> > has 1 map task and 1 reduce task, whereas the other has 5 map and 5
> reduce
> > tasks, how will the switch between these jobs take place?
>
>
> The switching happens on the task level - after one of the map tasks from
> the big job has finished, the small job will get its map task executed
> before the rest of the other job's.
>
>
> >
> >
> > Lets say the scheduler starts with the bigger job, runs 1 map task, when
> it
> > switches to the shorter job what does it do with the intermediate data?
> for
> > instance in Hadoop on demand if we run a search query where would the
> > search
> > keywords be stored? I assume if the bigger job is in middle of a map task
> > the smaller job will wait for the task to end before the the map task for
> > the shorter job is launched.
> >
>
> Intermediate data from the big job will be on the local disk like it always
> is - this isn't anything special about the fair scheduler. Map outputs
> remain in mapred.local.dir until the job is complete.
>
> -Todd
>
>
> On Thu, Aug 13, 2009 at 10:52 AM, Todd Lipcon <to...@cloudera.com> wrote:
>
> > Hi Mithila,
> >
> > I assume you're referring to fair scheduler preemption. In the preemption
> > scenario, tasks are completely killed, not paused. It's not like a
> > preemptive scheduler in your OS where things are "context switched". This
> > is
> > why the preemption is not enabled by default and has tuning parameters
> that
> > only trigger preemption in certain situations.
> >
> > Hope that helps,
> > -Todd
> >
> > On Thu, Aug 13, 2009 at 10:44 AM, Mithila Nagendra <mn...@asu.edu>
> > wrote:
> >
> > > Hello All
> > >
> > > When the fair scheduler switches between two jobs, what does it do with
> > the
> > > intermediary data? Does it dump the data/job states onto the disk
> (DFS)?
> > Or
> > > does it do a context switch (i.e. everything is in memory)? I was
> looking
> > > at
> > > the scheduler for an application I'm working on, any pointers will be
> > > appreciated!
> > >
> > > Thanks!
> > > Mithila Nagendra
> > > Arizona State University
> > >
> >
>

Re: Intermediary Data on Fair Scheduler

Posted by Todd Lipcon <to...@cloudera.com>.

On Thu, Aug 13, 2009 at 11:32 AM, Mithila Nagendra <mn...@asu.edu> wrote:

> Hi Todd
>
> So does this mean that when two jobs are assigned to a pool, where one job
> has 1 map task and 1 reduce task, whereas the other has 5 map and 5 reduce
> tasks, how will the switch between these jobs take place?


The switching happens on the task level - after one of the map tasks from
the big job has finished, the small job will get its map task executed
before the rest of the other job's.


>
>
> Lets say the scheduler starts with the bigger job, runs 1 map task, when it
> switches to the shorter job what does it do with the intermediate data? for
> instance in Hadoop on demand if we run a search query where would the
> search
> keywords be stored? I assume if the bigger job is in middle of a map task
> the smaller job will wait for the task to end before the the map task for
> the shorter job is launched.
>

Intermediate data from the big job will be on the local disk like it always
is - this isn't anything special about the fair scheduler. Map outputs
remain in mapred.local.dir until the job is complete.

-Todd


On Thu, Aug 13, 2009 at 10:52 AM, Todd Lipcon <to...@cloudera.com> wrote:

> Hi Mithila,
>
> I assume you're referring to fair scheduler preemption. In the preemption
> scenario, tasks are completely killed, not paused. It's not like a
> preemptive scheduler in your OS where things are "context switched". This
> is
> why the preemption is not enabled by default and has tuning parameters
that
> only trigger preemption in certain situations.
>
> Hope that helps,
> -Todd
>
> On Thu, Aug 13, 2009 at 10:44 AM, Mithila Nagendra <mn...@asu.edu>
> wrote:
>
> > Hello All
> >
> > When the fair scheduler switches between two jobs, what does it do with
> the
> > intermediary data? Does it dump the data/job states onto the disk (DFS)?
> Or
> > does it do a context switch (i.e. everything is in memory)? I was
looking
> > at
> > the scheduler for an application I'm working on, any pointers will be
> > appreciated!
> >
> > Thanks!
> > Mithila Nagendra
> > Arizona State University
> >
>

Re: Intermediary Data on Fair Scheduler

Posted by Mithila Nagendra <mn...@asu.edu>.

Hi Todd

So does this mean that when two jobs are assigned to a pool, where one job
has 1 map task and 1 reduce task, whereas the other has 5 map and 5 reduce
tasks, how will the switch between these jobs take place?

Lets say the scheduler starts with the bigger job, runs 1 map task, when it
switches to the shorter job what does it do with the intermediate data? for
instance in Hadoop on demand if we run a search query where would the search
keywords be stored? I assume if the bigger job is in middle of a map task
the smaller job will wait for the task to end before the the map task for
the shorter job is launched.

Thanks!
Mithila

On Thu, Aug 13, 2009 at 10:52 AM, Todd Lipcon <to...@cloudera.com> wrote:

> Hi Mithila,
>
> I assume you're referring to fair scheduler preemption. In the preemption
> scenario, tasks are completely killed, not paused. It's not like a
> preemptive scheduler in your OS where things are "context switched". This
> is
> why the preemption is not enabled by default and has tuning parameters that
> only trigger preemption in certain situations.
>
> Hope that helps,
> -Todd
>
> On Thu, Aug 13, 2009 at 10:44 AM, Mithila Nagendra <mn...@asu.edu>
> wrote:
>
> > Hello All
> >
> > When the fair scheduler switches between two jobs, what does it do with
> the
> > intermediary data? Does it dump the data/job states onto the disk (DFS)?
> Or
> > does it do a context switch (i.e. everything is in memory)? I was looking
> > at
> > the scheduler for an application I'm working on, any pointers will be
> > appreciated!
> >
> > Thanks!
> > Mithila Nagendra
> > Arizona State University
> >
>

Re: Intermediary Data on Fair Scheduler

Posted by Todd Lipcon <to...@cloudera.com>.

Hi Mithila,

I assume you're referring to fair scheduler preemption. In the preemption
scenario, tasks are completely killed, not paused. It's not like a
preemptive scheduler in your OS where things are "context switched". This is
why the preemption is not enabled by default and has tuning parameters that
only trigger preemption in certain situations.

Hope that helps,
-Todd

On Thu, Aug 13, 2009 at 10:44 AM, Mithila Nagendra <mn...@asu.edu> wrote:

> Hello All
>
> When the fair scheduler switches between two jobs, what does it do with the
> intermediary data? Does it dump the data/job states onto the disk (DFS)? Or
> does it do a context switch (i.e. everything is in memory)? I was looking
> at
> the scheduler for an application I'm working on, any pointers will be
> appreciated!
>
> Thanks!
> Mithila Nagendra
> Arizona State University
>