You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Albert Chu <ch...@llnl.gov> on 2017/04/11 23:42:41 UTC

Disk full errors in local-dirs, what data is stored in yarn.nodemanager.local-dirs?

Hi,

I have a cluster where we have a parallel networked file system for our
major data storage and our nodes have ~750G of local SSD space.  To
speed up things, we configure yarn.nodemanager.local-dirs to use the
local SSD for local caching.

Recently, I've been trying to do a terasort of 2 terabytes of data over
8 nodes w/ Hadoop 2.7.3.  So that's about 6000 gigs of local SSD space
for caching, or 5400 gigs when hadoop uses its 90% disk full checking
limit.

I always get diskfull errors such as the below when running:

2017-04-11 12:31:44,062 WARN org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection: Directory /l/ssd/achutest/localstore/yarn-nm error, used space above threshold of 90.0%, removing from list of valid directories
2017-04-11 12:31:44,063 INFO org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService: Disk(s) failed: 1/1 local-dirs are bad: /l/ssd/achutest/localstore/yarn-nm;
2017-04-11 12:31:44,063 ERROR org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService: Most of the disks failed. 1/1 local-dirs are bad: /l/ssd/achutest/localstore/yarn-nm;

What I don't understand is how I am getting diskfull errors.  Within
terasort, I should have at most 2000 gigs of mapped intermediate data
and at most 2000 gigs of merged data in reducers.  Even assuming some
overhead from Hadoop, I should have more than enough space for this
benchmark to complete given maps and reducers are spread out evenly
across nodes.

So my assumption is something else is being cached in local-dirs that
I'm not accounting for.  Is there any other data I should consider when
coming up with my estimates?

One guess I had.  Is it possible spilled data from reducer merges are
not deleted until a reducer completes?  Given my example above, the
total amount of merged data in reducers may exceed 2000 gigs at some
point?

Al

-- 
Albert Chu
chu11@llnl.gov
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
For additional commands, e-mail: user-help@hadoop.apache.org


Re: Disk full errors in local-dirs, what data is stored in yarn.nodemanager.local-dirs?

Posted by Sidharth Kumar <si...@gmail.com>.
Hi,

Can you paste the output of "df -h" command here.

Regards
Sidharth

On Wednesday, April 12, 2017, Albert Chu <ch...@llnl.gov> wrote:

> Hi,
>
> I have a cluster where we have a parallel networked file system for our
> major data storage and our nodes have ~750G of local SSD space.  To
> speed up things, we configure yarn.nodemanager.local-dirs to use the
> local SSD for local caching.
>
> Recently, I've been trying to do a terasort of 2 terabytes of data over
> 8 nodes w/ Hadoop 2.7.3.  So that's about 6000 gigs of local SSD space
> for caching, or 5400 gigs when hadoop uses its 90% disk full checking
> limit.
>
> I always get diskfull errors such as the below when running:
>
> 2017-04-11 12:31:44,062 WARN org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection:
> Directory /l/ssd/achutest/localstore/yarn-nm error, used space above
> threshold of 90.0%, removing from list of valid directories
> 2017-04-11 12:31:44,063 INFO org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService:
> Disk(s) failed: 1/1 local-dirs are bad: /l/ssd/achutest/localstore/
> yarn-nm;
> 2017-04-11 12:31:44,063 ERROR org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService:
> Most of the disks failed. 1/1 local-dirs are bad:
> /l/ssd/achutest/localstore/yarn-nm;
>
> What I don't understand is how I am getting diskfull errors.  Within
> terasort, I should have at most 2000 gigs of mapped intermediate data
> and at most 2000 gigs of merged data in reducers.  Even assuming some
> overhead from Hadoop, I should have more than enough space for this
> benchmark to complete given maps and reducers are spread out evenly
> across nodes.
>
> So my assumption is something else is being cached in local-dirs that
> I'm not accounting for.  Is there any other data I should consider when
> coming up with my estimates?
>
> One guess I had.  Is it possible spilled data from reducer merges are
> not deleted until a reducer completes?  Given my example above, the
> total amount of merged data in reducers may exceed 2000 gigs at some
> point?
>
> Al
>
> --
> Albert Chu
> chu11@llnl.gov <javascript:;>
> Computer Scientist
> High Performance Systems Division
> Lawrence Livermore National Laboratory
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org <javascript:;>
> For additional commands, e-mail: user-help@hadoop.apache.org
> <javascript:;>
>
>

-- 
Regards
Sidharth Kumar | Mob: +91 8197 555 599 | LinkedIn
<https://www.linkedin.com/in/sidharthkumar2792/>