You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Allen Wittenauer (JIRA)" <ji...@apache.org> on 2014/07/29 22:52:39 UTC

[jira] [Comment Edited] (MAPREDUCE-1296) Tasks fail after the first disk (/grid/0/) of all TTs reaches 100%, even though other disks still have space.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790916#comment-12790916 ] 

Allen Wittenauer edited comment on MAPREDUCE-1296 at 7/29/14 8:51 PM:
----------------------------------------------------------------------

bq. Should we move this discussion to an HDFS ticket?

Honestly, there is no point in discussing the issues around reservation.  It is just one of those topics I've thrown my hands up on and have opted for a workaround (dedicated mapred and hdfs filesystems).  There is an ancient JIRA where I think this already took place anyway.

bq. I think it's a bit different than that one - in the case of "out of space", you don't want to blacklist the volume, since as you said, the space usage is fluctuating and if a disk has been out of space once, it isn't the case that it will always be so.

I was thinking in a more general sense (bad disk=woops, we do the wrong thing). :)  But, yes, of course you are correct here.  

We should probably add some edge case testing: what happens to process X when disk is full to help spot these things.


was (Author: aw):
bg. Should we move this discussion to an HDFS ticket?

Honestly, there is no point in discussing the issues around reservation.  It is just one of those topics I've thrown my hands up on and have opted for a workaround (dedicated mapred and hdfs filesystems).  There is an ancient JIRA where I think this already took place anyway.

bg. I think it's a bit different than that one - in the case of "out of space", you don't want to blacklist the volume, since as you said, the space usage is fluctuating and if a disk has been out of space once, it isn't the case that it will always be so.

I was thinking in a more general sense (bad disk=woops, we do the wrong thing). :)  But, yes, of course you are correct here.  

We should probably add some edge case testing: what happens to process X when disk is full to help spot these things.

> Tasks fail after the first disk (/grid/0/) of all TTs reaches 100%, even though other disks still have space.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-1296
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1296
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: capacity-sched
>    Affects Versions: 0.20.2
>            Reporter: Iyappan Srinivasan
>
> Tasks fail after the first disk (/grid/0/) of all TTs reaches 100%, even though other disks still have space.
> In a cluster, data is distributed almost uniformly.  Disk /grid/0/ reaches 100% first, because of extra filling up of info like logs etc. After it reaches 100% tasks starts to fail with the error, 
> java.lang.Throwable: Child Error
> 	at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:516)
> Caused by: java.io.IOException: Task process exit with nonzero status of 1.
> 	at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:503)
> This happens even though the other disks are still at 80%, so still can be filled up more.
> Steps to reproduce:
> 1) Bring up  a cluster with Linux task controller.
> 2) Start filling the dfs up with data using randomwriter or teragen.
> 3) Once the first disk reaches 100%, the tasks are starting to fail.



--
This message was sent by Atlassian JIRA
(v6.2#6252)