You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by Rod Taylor <rb...@sitesell.com> on 2006/05/21 20:22:14 UTC

Re: Job scheduling (Re: Unable to run more than one job concurrently)

> > (2) Have a per-job total task count limit. Currently, we establish the
> > number of tasks each node runs, and how many map or reduce tasks we have
> > total in a given job. But it would be great if we could set a ceiling on the
> > number of tasks that run concurrently for a given job. This may help with
> > Andrzej's fetcher (since it is bandwidth constrained, maybe fewer concurrent
> > jobs would be fine?).
> 
> I like this idea.  So if the highest-priority job is already running at 
> its task limit, then tasks can be run from the next highest-priority 
> job.  Should there be separate limits for maps and reduces?

Limits for map and reduce are useful for a job class. Not so much for a
specific job instance.  Data collection may be best achieved with 15
parallel maps pulling data from remote data sources. But if the fact
there are 3 from one job and 12 from another isn't important. It's
important that 15 makes best use of resources.

A different priority for map and reduce would also be useful. Many times
data collection in a set timeframe is far more important than reducing
it for storage or post processing, particularly when data collection is
retrieving it from a remote resource.


Data warehousing activities often require that data collection occur
once a night between set hours (very high priority) but processing of
the data collected can occur any time until the end of the quarter.


For Nutch, with both of the above you should be able to achieve N number
of Fetch Map processes running at all times with everything else being
secondary within the remaining resources. This could make use of 100% of
available remote bandwidth.

-- 
Rod Taylor <rb...@sitesell.com>

Re: Job scheduling (Re: Unable to run more than one job concurrently)

Posted by Rod Taylor <rb...@sitesell.com>.

> You have no guarantee that your time sensitive data is safe /  
> committed until after your reduce has completed.  If you care about  
> reliability or data integrity, simply run a full map-reduce job in  
> your collection window and store the result in the HDFS.

Perhaps I explained incorrectly. It's NOT the data that is time
sensitive it is the resource availability that is time sensitive. With a
given availability window for retrieval. So long as sorting is a
requirement of reduce, the overhead of saving is going to remain
significant.

> Do expensive post processing you have a quarter to complete as  
> another job.  Being able to preempt a long job with a time sensitive  
> short job seems to really be your requirement.

Fetch has the same problem. Running fetches end-to-end (starting a new
one the instant a previous has finished) you end up with lulls between
fetches. For me this is about 15% of the time (15% wasted bandwidth
since you pay a flat rate).

My machines all have 12GB ram -- temporary storage is in memory -- and
reasonably fast processors. I really don't want to hold up a new fetch
map for a previous rounds fetch reduce.

> On May 21, 2006, at 11:22 AM, Rod Taylor wrote:
> 
> >
> >>> (2) Have a per-job total task count limit. Currently, we  
> >>> establish the
> >>> number of tasks each node runs, and how many map or reduce tasks  
> >>> we have
> >>> total in a given job. But it would be great if we could set a  
> >>> ceiling on the
> >>> number of tasks that run concurrently for a given job. This may  
> >>> help with
> >>> Andrzej's fetcher (since it is bandwidth constrained, maybe fewer  
> >>> concurrent
> >>> jobs would be fine?).
> >>
> >> I like this idea.  So if the highest-priority job is already  
> >> running at
> >> its task limit, then tasks can be run from the next highest-priority
> >> job.  Should there be separate limits for maps and reduces?
> >
> > Limits for map and reduce are useful for a job class. Not so much  
> > for a
> > specific job instance.  Data collection may be best achieved with 15
> > parallel maps pulling data from remote data sources. But if the fact
> > there are 3 from one job and 12 from another isn't important. It's
> > important that 15 makes best use of resources.
> >
> > A different priority for map and reduce would also be useful. Many  
> > times
> > data collection in a set timeframe is far more important than reducing
> > it for storage or post processing, particularly when data  
> > collection is
> > retrieving it from a remote resource.
> >
> >
> > Data warehousing activities often require that data collection occur
> > once a night between set hours (very high priority) but processing of
> > the data collected can occur any time until the end of the quarter.
> >
> >
> > For Nutch, with both of the above you should be able to achieve N  
> > number
> > of Fetch Map processes running at all times with everything else being
> > secondary within the remaining resources. This could make use of  
> > 100% of
> > available remote bandwidth.
> >
> > -- 
> > Rod Taylor <rb...@sitesell.com>
> >
> 
> 
-- 
Rod Taylor <rb...@sitesell.com>

Re: Job scheduling (Re: Unable to run more than one job concurrently)

Posted by Eric Baldeschwieler <er...@yahoo-inc.com>.

??

You have no guarantee that your time sensitive data is safe /  
committed until after your reduce has completed.  If you care about  
reliability or data integrity, simply run a full map-reduce job in  
your collection window and store the result in the HDFS.

Do expensive post processing you have a quarter to complete as  
another job.  Being able to preempt a long job with a time sensitive  
short job seems to really be your requirement.

On May 21, 2006, at 11:22 AM, Rod Taylor wrote:

>
>>> (2) Have a per-job total task count limit. Currently, we  
>>> establish the
>>> number of tasks each node runs, and how many map or reduce tasks  
>>> we have
>>> total in a given job. But it would be great if we could set a  
>>> ceiling on the
>>> number of tasks that run concurrently for a given job. This may  
>>> help with
>>> Andrzej's fetcher (since it is bandwidth constrained, maybe fewer  
>>> concurrent
>>> jobs would be fine?).
>>
>> I like this idea.  So if the highest-priority job is already  
>> running at
>> its task limit, then tasks can be run from the next highest-priority
>> job.  Should there be separate limits for maps and reduces?
>
> Limits for map and reduce are useful for a job class. Not so much  
> for a
> specific job instance.  Data collection may be best achieved with 15
> parallel maps pulling data from remote data sources. But if the fact
> there are 3 from one job and 12 from another isn't important. It's
> important that 15 makes best use of resources.
>
> A different priority for map and reduce would also be useful. Many  
> times
> data collection in a set timeframe is far more important than reducing
> it for storage or post processing, particularly when data  
> collection is
> retrieving it from a remote resource.
>
>
> Data warehousing activities often require that data collection occur
> once a night between set hours (very high priority) but processing of
> the data collected can occur any time until the end of the quarter.
>
>
> For Nutch, with both of the above you should be able to achieve N  
> number
> of Fetch Map processes running at all times with everything else being
> secondary within the remaining resources. This could make use of  
> 100% of
> available remote bandwidth.
>
> -- 
> Rod Taylor <rb...@sitesell.com>
>