You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Marco Nicosia (JIRA)" <ji...@apache.org> on 2008/09/08 18:11:44 UTC

[jira] Created: (HADOOP-4115) Reducer gets stuck in shuffle when local disk out of space

Reducer gets stuck in shuffle when local disk out of space
----------------------------------------------------------

                 Key: HADOOP-4115
                 URL: https://issues.apache.org/jira/browse/HADOOP-4115
             Project: Hadoop Core
          Issue Type: Bug
          Components: mapred
    Affects Versions: 0.17.2
            Reporter: Marco Nicosia


2008-08-29 23:53:12,357 WARN org.apache.hadoop.mapred.ReduceTask: task_200808291851_0001_r_000245_0 Merging of the local FS files threw an exception: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:47)
	at java.io.DataOutputStream.write(DataOutputStream.java:90)
	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.writeChunk(ChecksumFileSystem.java:339)
	at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:155)
	at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:132)
	at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:121)
	at org.apache.hadoop.fs.FSOutputSummer.write1(FSOutputSummer.java:112)
	at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:86)
	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:47)
	at java.io.DataOutputStream.write(DataOutputStream.java:90)
	at org.apache.hadoop.io.SequenceFile$UncompressedBytes.writeUncompressedBytes(SequenceFile.java:617)
	at org.apache.hadoop.io.SequenceFile$Writer.appendRaw(SequenceFile.java:1038)
	at org.apache.hadoop.io.SequenceFile$Sorter.writeFile(SequenceFile.java:2626)
	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$LocalFSMerger.run(ReduceTask.java:1564)
Caused by: java.io.IOException: No space left on device
	at java.io.FileOutputStream.writeBytes(Native Method)
	at java.io.FileOutputStream.write(FileOutputStream.java:260)
	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
	... 16 more

2008-08-29 23:53:14,013 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
java.io.IOException: task_200808291851_0001_r_000245_0The reduce copier failed
	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-4115) Reducer gets stuck in shuffle when local disk out of space

Posted by "Marco Nicosia (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Marco Nicosia updated HADOOP-4115:
----------------------------------

    Priority: Critical  (was: Major)

> Reducer gets stuck in shuffle when local disk out of space
> ----------------------------------------------------------
>
>                 Key: HADOOP-4115
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4115
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.2
>            Reporter: Marco Nicosia
>            Priority: Critical
>
> 2008-08-29 23:53:12,357 WARN org.apache.hadoop.mapred.ReduceTask: task_200808291851_0001_r_000245_0 Merging of the local FS files threw an exception: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
> 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> 	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:47)
> 	at java.io.DataOutputStream.write(DataOutputStream.java:90)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.writeChunk(ChecksumFileSystem.java:339)
> 	at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:155)
> 	at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:132)
> 	at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:121)
> 	at org.apache.hadoop.fs.FSOutputSummer.write1(FSOutputSummer.java:112)
> 	at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:86)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:47)
> 	at java.io.DataOutputStream.write(DataOutputStream.java:90)
> 	at org.apache.hadoop.io.SequenceFile$UncompressedBytes.writeUncompressedBytes(SequenceFile.java:617)
> 	at org.apache.hadoop.io.SequenceFile$Writer.appendRaw(SequenceFile.java:1038)
> 	at org.apache.hadoop.io.SequenceFile$Sorter.writeFile(SequenceFile.java:2626)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$LocalFSMerger.run(ReduceTask.java:1564)
> Caused by: java.io.IOException: No space left on device
> 	at java.io.FileOutputStream.writeBytes(Native Method)
> 	at java.io.FileOutputStream.write(FileOutputStream.java:260)
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
> 	... 16 more
> 2008-08-29 23:53:14,013 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
> java.io.IOException: task_200808291851_0001_r_000245_0The reduce copier failed
> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4115) Reducer gets stuck in shuffle when local disk out of space

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629259#action_12629259 ] 

Devaraj Das commented on HADOOP-4115:
-------------------------------------

Ideally, the task should have exited when this exception was thrown. By any chance, did you get a jstack dump of the task when it hung after the exception? (kill -3 <pid> would also help). At this point of time, one suspect is some thread is not a daemon thread and it is preventing the process from exiting. The other suspect is that the task JVM is stuck for some reason in the finally clause of TaskTracker.Child.main(). Do you know whether the TT was reachable and whether it logged the task failure?

> Reducer gets stuck in shuffle when local disk out of space
> ----------------------------------------------------------
>
>                 Key: HADOOP-4115
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4115
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.2
>            Reporter: Marco Nicosia
>            Priority: Critical
>
> 2008-08-29 23:53:12,357 WARN org.apache.hadoop.mapred.ReduceTask: task_200808291851_0001_r_000245_0 Merging of the local FS files threw an exception: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
> 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> 	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:47)
> 	at java.io.DataOutputStream.write(DataOutputStream.java:90)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.writeChunk(ChecksumFileSystem.java:339)
> 	at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:155)
> 	at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:132)
> 	at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:121)
> 	at org.apache.hadoop.fs.FSOutputSummer.write1(FSOutputSummer.java:112)
> 	at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:86)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:47)
> 	at java.io.DataOutputStream.write(DataOutputStream.java:90)
> 	at org.apache.hadoop.io.SequenceFile$UncompressedBytes.writeUncompressedBytes(SequenceFile.java:617)
> 	at org.apache.hadoop.io.SequenceFile$Writer.appendRaw(SequenceFile.java:1038)
> 	at org.apache.hadoop.io.SequenceFile$Sorter.writeFile(SequenceFile.java:2626)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$LocalFSMerger.run(ReduceTask.java:1564)
> Caused by: java.io.IOException: No space left on device
> 	at java.io.FileOutputStream.writeBytes(Native Method)
> 	at java.io.FileOutputStream.write(FileOutputStream.java:260)
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
> 	... 16 more
> 2008-08-29 23:53:14,013 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
> java.io.IOException: task_200808291851_0001_r_000245_0The reduce copier failed
> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4115) Reducer gets stuck in shuffle when local disk out of space

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629293#action_12629293 ] 

Chris Douglas commented on HADOOP-4115:
---------------------------------------

Is there space available on other drives on that TT that aren't being used, or are all configured drives completely out of space? Does the reduce eventually fail and get rescheduled or does it hang? In the latter case, is the task ever rescheduled/speculated or does this state persist until the job is killed? In the former case, is it being rescheduled on the same node, ultimately and incorrectly failing the job, or does the job eventually succeed?

Quick aside: it would help a lot if the issue description were to present an abstract of the observed behavior; stack traces and other verbose diagnostic information is more readable (especially by email) in a comment.

> Reducer gets stuck in shuffle when local disk out of space
> ----------------------------------------------------------
>
>                 Key: HADOOP-4115
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4115
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.2
>            Reporter: Marco Nicosia
>            Priority: Critical
>
> 2008-08-29 23:53:12,357 WARN org.apache.hadoop.mapred.ReduceTask: task_200808291851_0001_r_000245_0 Merging of the local FS files threw an exception: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
> 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> 	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:47)
> 	at java.io.DataOutputStream.write(DataOutputStream.java:90)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.writeChunk(ChecksumFileSystem.java:339)
> 	at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:155)
> 	at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:132)
> 	at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:121)
> 	at org.apache.hadoop.fs.FSOutputSummer.write1(FSOutputSummer.java:112)
> 	at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:86)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:47)
> 	at java.io.DataOutputStream.write(DataOutputStream.java:90)
> 	at org.apache.hadoop.io.SequenceFile$UncompressedBytes.writeUncompressedBytes(SequenceFile.java:617)
> 	at org.apache.hadoop.io.SequenceFile$Writer.appendRaw(SequenceFile.java:1038)
> 	at org.apache.hadoop.io.SequenceFile$Sorter.writeFile(SequenceFile.java:2626)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$LocalFSMerger.run(ReduceTask.java:1564)
> Caused by: java.io.IOException: No space left on device
> 	at java.io.FileOutputStream.writeBytes(Native Method)
> 	at java.io.FileOutputStream.write(FileOutputStream.java:260)
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
> 	... 16 more
> 2008-08-29 23:53:14,013 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
> java.io.IOException: task_200808291851_0001_r_000245_0The reduce copier failed
> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (HADOOP-4115) Reducer gets stuck in shuffle when local disk out of space

Posted by "Nigel Daley (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nigel Daley resolved HADOOP-4115.
---------------------------------

    Resolution: Duplicate

Runping, you marked this as a duplicate of another more active bug.  I'm resolving this one.

> Reducer gets stuck in shuffle when local disk out of space
> ----------------------------------------------------------
>
>                 Key: HADOOP-4115
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4115
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.2
>            Reporter: Marco Nicosia
>            Priority: Critical
>
> 2008-08-29 23:53:12,357 WARN org.apache.hadoop.mapred.ReduceTask: task_200808291851_0001_r_000245_0 Merging of the local FS files threw an exception: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
> 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> 	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:47)
> 	at java.io.DataOutputStream.write(DataOutputStream.java:90)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.writeChunk(ChecksumFileSystem.java:339)
> 	at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:155)
> 	at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:132)
> 	at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:121)
> 	at org.apache.hadoop.fs.FSOutputSummer.write1(FSOutputSummer.java:112)
> 	at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:86)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:47)
> 	at java.io.DataOutputStream.write(DataOutputStream.java:90)
> 	at org.apache.hadoop.io.SequenceFile$UncompressedBytes.writeUncompressedBytes(SequenceFile.java:617)
> 	at org.apache.hadoop.io.SequenceFile$Writer.appendRaw(SequenceFile.java:1038)
> 	at org.apache.hadoop.io.SequenceFile$Sorter.writeFile(SequenceFile.java:2626)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$LocalFSMerger.run(ReduceTask.java:1564)
> Caused by: java.io.IOException: No space left on device
> 	at java.io.FileOutputStream.writeBytes(Native Method)
> 	at java.io.FileOutputStream.write(FileOutputStream.java:260)
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
> 	... 16 more
> 2008-08-29 23:53:14,013 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
> java.io.IOException: task_200808291851_0001_r_000245_0The reduce copier failed
> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.