You are viewing a plain text version of this content. The canonical link for it is here.

Posted to yarn-issues@hadoop.apache.org by "Jason Lowe (JIRA)" <ji...@apache.org> on 2012/12/04 18:49:58 UTC

[jira] [Created] (YARN-257) NM should gracefully handle a full local disk

Jason Lowe created YARN-257:
-------------------------------

             Summary: NM should gracefully handle a full local disk
                 Key: YARN-257
                 URL: https://issues.apache.org/jira/browse/YARN-257
             Project: Hadoop YARN
          Issue Type: Improvement
          Components: nodemanager
    Affects Versions: 0.23.5, 2.0.2-alpha
            Reporter: Jason Lowe


When a local disk becomes full, the node will fail every container launched on it because the container is unable to localize.  It tries to create an app-specific directory for each local and log directories.  If any of those directory creates fail (due to lack of free space) the container fails.

It would be nice if the node could continue to launch containers using the space available on other disks rather than failing all containers trying to launch on the node.

This is somewhat related to YARN-91 but is centered around the disk becoming full rather than the disk failing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-257) NM should gracefully handle a full local disk

Posted by "Bikas Saha (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/YARN-257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13510812#comment-13510812 ] 

Bikas Saha commented on YARN-257:
---------------------------------

Before the complete change, would it help if the NM did not accept new containers. Maybe by indicating in the heartbeat that do not assign containers to it.
Why does the RM not notice abnormal failure rates on such an NM and put it out of rotation for scheduling?
                
> NM should gracefully handle a full local disk
> ---------------------------------------------
>
>                 Key: YARN-257
>                 URL: https://issues.apache.org/jira/browse/YARN-257
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: nodemanager
>    Affects Versions: 2.0.2-alpha, 0.23.5
>            Reporter: Jason Lowe
>
> When a local disk becomes full, the node will fail every container launched on it because the container is unable to localize.  It tries to create an app-specific directory for each local and log directories.  If any of those directory creates fail (due to lack of free space) the container fails.
> It would be nice if the node could continue to launch containers using the space available on other disks rather than failing all containers trying to launch on the node.
> This is somewhat related to YARN-91 but is centered around the disk becoming full rather than the disk failing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-257) NM should gracefully handle a full local disk

Posted by "Jason Lowe (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/YARN-257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13510842#comment-13510842 ] 

Jason Lowe commented on YARN-257:
---------------------------------

bq. Before the complete change, would it help if the NM did not accept new containers. Maybe by indicating in the heartbeat that do not assign containers to it.

Yes, it would be nice sometimes if a node could declare itself as being UNHEALTHY without causing all containers currently running to be shot as it does now.  Sort of a "let's drain the currently running containers but not allow any new ones" mode.

bq. Why does the RM not notice abnormal failure rates on such an NM and put it out of rotation for scheduling?

Currently the RM doesn't track container failures on nodes for purposes of blacklisting them.  AFAIK nodes can only be blacklisted by an RM by self-declaring themselves as UNHEALTHY via the health checker script that they run.  The MR AM is already tracking such things, but I don't beleive there's a feedback mechanism from the AM to the RM to help the RM figure out which nodes are bad from an AM's perspective.  Might be nice to have, and YARN-195 covers this to some extent.

As you indicate the RM could also check container failures solely via container status from the NMs and blacklist NMs based on some algorithm.  We need to be careful that a misconfigured large job doesn't end up blacklisting a large chunk of the cluster because all of its containers fail.  Think bad parameters on mapreduce.map.java.opts, for example, or a case where it doesn't get the classpath for its tasks correct.  And not all container failures from an AMs point of view are visible from the RM watching container status.  The container could exit cleanly but still fail at the app-level, for example.  So we might need both mechanisms.

                
> NM should gracefully handle a full local disk
> ---------------------------------------------
>
>                 Key: YARN-257
>                 URL: https://issues.apache.org/jira/browse/YARN-257
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: nodemanager
>    Affects Versions: 2.0.2-alpha, 0.23.5
>            Reporter: Jason Lowe
>
> When a local disk becomes full, the node will fail every container launched on it because the container is unable to localize.  It tries to create an app-specific directory for each local and log directories.  If any of those directory creates fail (due to lack of free space) the container fails.
> It would be nice if the node could continue to launch containers using the space available on other disks rather than failing all containers trying to launch on the node.
> This is somewhat related to YARN-91 but is centered around the disk becoming full rather than the disk failing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira