You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Debbie Fu <fu...@gmail.com> on 2011/01/03 13:28:07 UTC

large intermediate outputs

Hi,
Is there any possibility that the intermediate output might be too large to
store it in the local disk?
If there is, what does hadoop do to solve the problem?
Thanks.

-- 
Best regards!

Re: large intermediate outputs

Posted by Allen Wittenauer <aw...@linkedin.com>.

On Jan 3, 2011, at 5:11 AM, Debbie Fu wrote:

> I think it will cause a disk fill-up, too. Is there any mechanism in Hadoop
> that handles this situation?

	Not in a way that saves the job.

> If my local disk stores too much chunk data,
> and spare little space for intermediate output, and all nodes are in this
> situation that we can't schedule the task on another node that could have
> the space for intermediate output, so what does the hadoop do ? Does the job
> simply fail?

	Yes.

> Can I set a remote disk in mapred.local.dir?

	You can point it to an NFS mount, but that'd be suicide.

	Best bet is to break the job up into multiple jobs or reduce the input per task depending upon the situation if using compression as Harsh mentioned is not acceptable.

Re: large intermediate outputs

Posted by Debbie Fu <fu...@gmail.com>.

I think it will cause a disk fill-up, too. Is there any mechanism in Hadoop
that handles this situation? If my local disk stores too much chunk data,
and spare little space for intermediate output, and all nodes are in this
situation that we can't schedule the task on another node that could have
the space for intermediate output, so what does the hadoop do ? Does the job
simply fail? Can I set a remote disk in mapred.local.dir?

2011/1/3 Harsh J <qw...@gmail.com>

> Additionally, you can set mapred.local.dir to be a comma-separated
> list of paths that reside on multiple disks -- this spreads I/O plus
> gives you additional space.
>
> But I suppose if a single Mapper is writing a huge amount of data for
> a single partition output, it may cause a disk fill-up. Please correct
> me if am wrong here.
>
> On Mon, Jan 3, 2011 at 5:58 PM, Debbie Fu <fu...@gmail.com> wrote:
> > Hi,
> > Is there any possibility that the intermediate output might be too large
> to
> > store it in the local disk?
> > If there is, what does hadoop do to solve the problem?
> > Thanks.
> >
> > --
> > Best regards!
> >
> >
>
>
>
> --
> Harsh J
> www.harshj.com
>

-- 
Best regards!

Yulin Fu
SUN YAT-SEN UNIVERSITY
Mobile:13570409599
QQ:642786040

Re: large intermediate outputs

Posted by Harsh J <qw...@gmail.com>.

Additionally, you can set mapred.local.dir to be a comma-separated
list of paths that reside on multiple disks -- this spreads I/O plus
gives you additional space.

But I suppose if a single Mapper is writing a huge amount of data for
a single partition output, it may cause a disk fill-up. Please correct
me if am wrong here.

On Mon, Jan 3, 2011 at 5:58 PM, Debbie Fu <fu...@gmail.com> wrote:
> Hi,
> Is there any possibility that the intermediate output might be too large to
> store it in the local disk?
> If there is, what does hadoop do to solve the problem?
> Thanks.
>
> --
> Best regards!
>
>

-- 
Harsh J
www.harshj.com

Re: large intermediate outputs

Posted by Ravi Gummadi <gr...@yahoo-inc.com>.

The following 2 could solve the issue to some extent. But these 2 are not
automatically done by hadoop. User needs to set things before submitting the
job.
(1) Enabling map output compression using configuration property
mapreduce.map.output.compress.
(2) Use combiner so that possibly less amount of intermediate data will be
emitted by mapper.

-Ravi

On 1/3/11 5:58 PM, "Debbie Fu" <fu...@gmail.com> wrote:

> Hi,
> Is there any possibility that the intermediate output might be too large to
> store it in the local disk?
> If there is, what does hadoop do to solve the problem?
> Thanks.