You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by John Cuffney <cu...@gmail.com> on 2012/09/07 09:54:38 UTC

Job Controller for MapReduce task assignment

Hey,

Which class handles the top level partitioning for MapReduce?  It's
possible I have a misunderstanding of how this is handled, but in my view,
there is a top level controller which kicks off the whole process; it
handles partitioning of the input and distribution of the input segments to
the various machines/tasks.  I have been searching through a lot of the Job
classes, and they all seem to handle a single task, whereas it is important
for me to perform some work at the highest level controller, if that
exists.  Any info on what I'm looking for/if I'm on the wrong track would
be much appreciated.

Thanks for the help,
John

Re: Job Controller for MapReduce task assignment

Posted by Harsh J <ha...@cloudera.com>.
Hey John,

Here's how MR works, to speak simply:

- Job.submit() is called.
- Job's InputFormat#getSplits() is called, its result serialized and
shipped across, along with other job artifacts such as jars, etc., to
the configured FS, for the JobTracker or the MR2 ApplicationMaster for
use.
- The splits info contains locality hints that the scheduler then uses
to assign a host's slot or resources to, depending also on
availability/requested resources (hence, a 'hint', not strict).

The first two are client-end (controllable), the last is dependent on
the scheduler you've put in use (Fifo/Capacity/Fair) or have
implemented (Custom).

I'm unclear on what exactly you ask, but I think you may want to start
by reading the JobSubmitter class and go around from there.

Does this help?

On Fri, Sep 7, 2012 at 1:24 PM, John Cuffney <cu...@gmail.com> wrote:
> Hey,
>
> Which class handles the top level partitioning for MapReduce?  It's possible
> I have a misunderstanding of how this is handled, but in my view, there is a
> top level controller which kicks off the whole process; it handles
> partitioning of the input and distribution of the input segments to the
> various machines/tasks.  I have been searching through a lot of the Job
> classes, and they all seem to handle a single task, whereas it is important
> for me to perform some work at the highest level controller, if that exists.
> Any info on what I'm looking for/if I'm on the wrong track would be much
> appreciated.
>
> Thanks for the help,
> John



-- 
Harsh J

Re: Job Controller for MapReduce task assignment

Posted by Harsh J <ha...@cloudera.com>.
Hey John,

Here's how MR works, to speak simply:

- Job.submit() is called.
- Job's InputFormat#getSplits() is called, its result serialized and
shipped across, along with other job artifacts such as jars, etc., to
the configured FS, for the JobTracker or the MR2 ApplicationMaster for
use.
- The splits info contains locality hints that the scheduler then uses
to assign a host's slot or resources to, depending also on
availability/requested resources (hence, a 'hint', not strict).

The first two are client-end (controllable), the last is dependent on
the scheduler you've put in use (Fifo/Capacity/Fair) or have
implemented (Custom).

I'm unclear on what exactly you ask, but I think you may want to start
by reading the JobSubmitter class and go around from there.

Does this help?

On Fri, Sep 7, 2012 at 1:24 PM, John Cuffney <cu...@gmail.com> wrote:
> Hey,
>
> Which class handles the top level partitioning for MapReduce?  It's possible
> I have a misunderstanding of how this is handled, but in my view, there is a
> top level controller which kicks off the whole process; it handles
> partitioning of the input and distribution of the input segments to the
> various machines/tasks.  I have been searching through a lot of the Job
> classes, and they all seem to handle a single task, whereas it is important
> for me to perform some work at the highest level controller, if that exists.
> Any info on what I'm looking for/if I'm on the wrong track would be much
> appreciated.
>
> Thanks for the help,
> John



-- 
Harsh J

Re: Job Controller for MapReduce task assignment

Posted by Harsh J <ha...@cloudera.com>.
Hey John,

Here's how MR works, to speak simply:

- Job.submit() is called.
- Job's InputFormat#getSplits() is called, its result serialized and
shipped across, along with other job artifacts such as jars, etc., to
the configured FS, for the JobTracker or the MR2 ApplicationMaster for
use.
- The splits info contains locality hints that the scheduler then uses
to assign a host's slot or resources to, depending also on
availability/requested resources (hence, a 'hint', not strict).

The first two are client-end (controllable), the last is dependent on
the scheduler you've put in use (Fifo/Capacity/Fair) or have
implemented (Custom).

I'm unclear on what exactly you ask, but I think you may want to start
by reading the JobSubmitter class and go around from there.

Does this help?

On Fri, Sep 7, 2012 at 1:24 PM, John Cuffney <cu...@gmail.com> wrote:
> Hey,
>
> Which class handles the top level partitioning for MapReduce?  It's possible
> I have a misunderstanding of how this is handled, but in my view, there is a
> top level controller which kicks off the whole process; it handles
> partitioning of the input and distribution of the input segments to the
> various machines/tasks.  I have been searching through a lot of the Job
> classes, and they all seem to handle a single task, whereas it is important
> for me to perform some work at the highest level controller, if that exists.
> Any info on what I'm looking for/if I'm on the wrong track would be much
> appreciated.
>
> Thanks for the help,
> John



-- 
Harsh J

Re: Job Controller for MapReduce task assignment

Posted by Harsh J <ha...@cloudera.com>.
Hey John,

Here's how MR works, to speak simply:

- Job.submit() is called.
- Job's InputFormat#getSplits() is called, its result serialized and
shipped across, along with other job artifacts such as jars, etc., to
the configured FS, for the JobTracker or the MR2 ApplicationMaster for
use.
- The splits info contains locality hints that the scheduler then uses
to assign a host's slot or resources to, depending also on
availability/requested resources (hence, a 'hint', not strict).

The first two are client-end (controllable), the last is dependent on
the scheduler you've put in use (Fifo/Capacity/Fair) or have
implemented (Custom).

I'm unclear on what exactly you ask, but I think you may want to start
by reading the JobSubmitter class and go around from there.

Does this help?

On Fri, Sep 7, 2012 at 1:24 PM, John Cuffney <cu...@gmail.com> wrote:
> Hey,
>
> Which class handles the top level partitioning for MapReduce?  It's possible
> I have a misunderstanding of how this is handled, but in my view, there is a
> top level controller which kicks off the whole process; it handles
> partitioning of the input and distribution of the input segments to the
> various machines/tasks.  I have been searching through a lot of the Job
> classes, and they all seem to handle a single task, whereas it is important
> for me to perform some work at the highest level controller, if that exists.
> Any info on what I'm looking for/if I'm on the wrong track would be much
> appreciated.
>
> Thanks for the help,
> John



-- 
Harsh J