You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Matt Steele <rm...@gmail.com> on 2011/09/22 00:45:10 UTC

quotas for size of intermediate map/reduce output?

Hi All,

Is it possible to enforce a maximum to the disk space consumed by a
map/reduce job's intermediate output?  It looks like you can impose limits
on hdfs consumption, or, via the capacity scheduler, limits on the RAM that
a map/reduce slot uses, or the number of slots used.

But if I'm worried that a job might exhaust the cluster's disk capacity
during the shuffle, my sense is that I'd have to quarantine the job on a
separate cluster.  Am I wrong?  Do you have any suggestions for me?

Thanks,
Matt

Re: quotas for size of intermediate map/reduce output?

Posted by Matt Steele <rm...@gmail.com>.
Thanks for this info; it sounds like we should upgrade to 0.20.204.

If more than one job is running when the cluster loses its ability to
schedule new tasks due to insufficient disk space, do you know what logic
the jobtracker uses to decide which job to kill?

-Matt

On Wed, Sep 21, 2011 at 4:36 PM, Arun C Murthy <ac...@hortonworks.com> wrote:

> We do track intermediate output used and if a job is using too much and
> can't be scheduled anywhere on a cluster the CS/JT will fail it. You'll need
> hadoop-0.20.204 for this though.
>
> Also, with MRv2 we are in the process of adding limits on disk usage for
> intermediate outputs, logs etc.
>
> hth,
> Arun
>
> On Sep 21, 2011, at 3:45 PM, Matt Steele wrote:
>
> > Hi All,
> >
> > Is it possible to enforce a maximum to the disk space consumed by a
> map/reduce job's intermediate output?  It looks like you can impose limits
> on hdfs consumption, or, via the capacity scheduler, limits on the RAM that
> a map/reduce slot uses, or the number of slots used.
> >
> > But if I'm worried that a job might exhaust the cluster's disk capacity
> during the shuffle, my sense is that I'd have to quarantine the job on a
> separate cluster.  Am I wrong?  Do you have any suggestions for me?
> >
> > Thanks,
> > Matt
>
>

Re: quotas for size of intermediate map/reduce output?

Posted by Arun C Murthy <ac...@hortonworks.com>.
We do track intermediate output used and if a job is using too much and can't be scheduled anywhere on a cluster the CS/JT will fail it. You'll need hadoop-0.20.204 for this though.

Also, with MRv2 we are in the process of adding limits on disk usage for intermediate outputs, logs etc.

hth,
Arun

On Sep 21, 2011, at 3:45 PM, Matt Steele wrote:

> Hi All,
> 
> Is it possible to enforce a maximum to the disk space consumed by a map/reduce job's intermediate output?  It looks like you can impose limits on hdfs consumption, or, via the capacity scheduler, limits on the RAM that a map/reduce slot uses, or the number of slots used.
> 
> But if I'm worried that a job might exhaust the cluster's disk capacity during the shuffle, my sense is that I'd have to quarantine the job on a separate cluster.  Am I wrong?  Do you have any suggestions for me?
> 
> Thanks,
> Matt