You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by amit handa <am...@gmail.com> on 2009/04/07 16:54:18 UTC

Handling Non Homogenous tasks via Hadoop

Hi,

Is there a way I can control number of tasks that can be spawned on a
machine based on the machine capacity and how loaded the machine already is
?

My use case is as following:

I have to perform task 1,task2,task3 ...task n .
These tasks have varied CPU and memory usage patterns.
All tasks of type task 1,task3 can take 80-90%CPU and 800 MB of RAM.
All type of tasks task2 take only 1-2% of CPU and 5-10 MB of RAM

How do i model this using Hadoop ? Can i use only one cluster for running
all these type of tasks ?
Shall I use different hadoop clusters for each tasktype , if yes, then how
do i share data between these tasks (the data can be few MB to few GB)

Please suggest or point to any docs which i can dig up.

Thanks,
Amit

Re: Handling Non Homogenous tasks via Hadoop

Posted by amit handa <am...@gmail.com>.

Thanks Aaron.

On Wed, Apr 8, 2009 at 12:00 AM, Aaron Kimball <aa...@cloudera.com> wrote:

> Amit,
>
> The mapred.tasktracker.map.tasks.maximum and
> mapred.tasktracker.reduce.tasks.maximum properties can be controlled on a
> per-host basis in their hadoop-site.xml files. With this you can configure
> nodes with more/fewer cores/RAM/etc to take on varying amounts of work.
>
> There's no current mechanism to provide feedback to the task scheduler,
> though, based on actual machine utilization in real time.
>
> - Aaron
>
>
> On Tue, Apr 7, 2009 at 7:54 AM, amit handa <am...@gmail.com> wrote:
>
> > Hi,
> >
> > Is there a way I can control number of tasks that can be spawned on a
> > machine based on the machine capacity and how loaded the machine already
> is
> > ?
> >
> > My use case is as following:
> >
> > I have to perform task 1,task2,task3 ...task n .
> > These tasks have varied CPU and memory usage patterns.
> > All tasks of type task 1,task3 can take 80-90%CPU and 800 MB of RAM.
> > All type of tasks task2 take only 1-2% of CPU and 5-10 MB of RAM
> >
> > How do i model this using Hadoop ? Can i use only one cluster for running
> > all these type of tasks ?
> > Shall I use different hadoop clusters for each tasktype , if yes, then
> how
> > do i share data between these tasks (the data can be few MB to few GB)
> >
> > Please suggest or point to any docs which i can dig up.
> >
> > Thanks,
> > Amit
> >
>

Re: Handling Non Homogenous tasks via Hadoop

Posted by Aaron Kimball <aa...@cloudera.com>.

Amit,

The mapred.tasktracker.map.tasks.maximum and
mapred.tasktracker.reduce.tasks.maximum properties can be controlled on a
per-host basis in their hadoop-site.xml files. With this you can configure
nodes with more/fewer cores/RAM/etc to take on varying amounts of work.

There's no current mechanism to provide feedback to the task scheduler,
though, based on actual machine utilization in real time.

- Aaron

On Tue, Apr 7, 2009 at 7:54 AM, amit handa <am...@gmail.com> wrote:

> Hi,
>
> Is there a way I can control number of tasks that can be spawned on a
> machine based on the machine capacity and how loaded the machine already is
> ?
>
> My use case is as following:
>
> I have to perform task 1,task2,task3 ...task n .
> These tasks have varied CPU and memory usage patterns.
> All tasks of type task 1,task3 can take 80-90%CPU and 800 MB of RAM.
> All type of tasks task2 take only 1-2% of CPU and 5-10 MB of RAM
>
> How do i model this using Hadoop ? Can i use only one cluster for running
> all these type of tasks ?
> Shall I use different hadoop clusters for each tasktype , if yes, then how
> do i share data between these tasks (the data can be few MB to few GB)
>
> Please suggest or point to any docs which i can dig up.
>
> Thanks,
> Amit
>