You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Runping Qi (JIRA)" <ji...@apache.org> on 2008/09/12 00:03:44 UTC

[jira] Created: (HADOOP-4163) If a reducer failed at shuffling stage, the task should fail, not just logging an exception

If a reducer failed at shuffling stage, the task should fail, not just logging an exception
-------------------------------------------------------------------------------------------

                 Key: HADOOP-4163
                 URL: https://issues.apache.org/jira/browse/HADOOP-4163
             Project: Hadoop Core
          Issue Type: Bug
          Components: mapred
    Affects Versions: 0.17.1
            Reporter: Runping Qi




I saw a reducer stuck at the shuffling stage, with the following exception logged in the log file:

2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
	at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
	at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185)
	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
Caused by: java.io.IOException: No space left on device
	at java.io.FileOutputStream.writeBytes(Native Method)
	at java.io.FileOutputStream.write(FileOutputStream.java:260)
	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
	... 11 more

2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed
	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)

The task should have died.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4163) If a reducer failed at shuffling stage, the task should fail, not just logging an exception

Posted by "Sharad Agarwal (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sharad Agarwal updated HADOOP-4163:
-----------------------------------

    Attachment: 4163_v2.patch

incorporated Chris' comments

> If a reducer failed at shuffling stage, the task should fail, not just logging an exception
> -------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4163
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4163
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Runping Qi
>            Assignee: Sharad Agarwal
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: 4163_v1.patch, 4163_v2.patch
>
>
> I saw a reducer stuck at the shuffling stage, with the following exception logged in the log file:
> 2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
> 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> 	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
> 	at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
> 	at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
> Caused by: java.io.IOException: No space left on device
> 	at java.io.FileOutputStream.writeBytes(Native Method)
> 	at java.io.FileOutputStream.write(FileOutputStream.java:260)
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
> 	... 11 more
> 2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
> java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed
> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
> The task should have died.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4163) If a reducer failed at shuffling stage, the task should fail, not just logging an exception

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635364#action_12635364 ] 

Hadoop QA commented on HADOOP-4163:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12391114/4163_v1.patch
  against trunk revision 700028.

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no tests are needed for this patch.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 1 new Findbugs warnings.

    +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

    -1 core tests.  The patch failed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3391/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3391/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3391/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3391/console

This message is automatically generated.

> If a reducer failed at shuffling stage, the task should fail, not just logging an exception
> -------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4163
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4163
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Runping Qi
>            Assignee: Sharad Agarwal
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: 4163_v1.patch
>
>
> I saw a reducer stuck at the shuffling stage, with the following exception logged in the log file:
> 2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
> 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> 	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
> 	at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
> 	at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
> Caused by: java.io.IOException: No space left on device
> 	at java.io.FileOutputStream.writeBytes(Native Method)
> 	at java.io.FileOutputStream.write(FileOutputStream.java:260)
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
> 	... 11 more
> 2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
> java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed
> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
> The task should have died.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-4163) If a reducer failed at shuffling stage, the task should fail, not just logging an exception

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12634379#action_12634379 ] 

amareshwari edited comment on HADOOP-4163 at 9/24/08 11:03 PM:
---------------------------------------------------------------------------

Runping, Sorry for asking this late. Is it possible to get JobTracker and TaskTracker logs for the reduce task?

      was (Author: amareshwari):
    Runping, Sorry for asking this late. Is it possible to get JobTracker and TaskTracker for the reduce task?
  
> If a reducer failed at shuffling stage, the task should fail, not just logging an exception
> -------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4163
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4163
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Runping Qi
>            Assignee: Amareshwari Sriramadasu
>            Priority: Blocker
>             Fix For: 0.19.0
>
>
> I saw a reducer stuck at the shuffling stage, with the following exception logged in the log file:
> 2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
> 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> 	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
> 	at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
> 	at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
> Caused by: java.io.IOException: No space left on device
> 	at java.io.FileOutputStream.writeBytes(Native Method)
> 	at java.io.FileOutputStream.write(FileOutputStream.java:260)
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
> 	... 11 more
> 2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
> java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed
> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
> The task should have died.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4163) If a reducer failed at shuffling stage, the task should fail, not just logging an exception

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-4163:
----------------------------------

    Status: Open  (was: Patch Available)

* handleIfFSError(t) doesn't need to be called in contexts where mergeThrowable is set. Equivalent code should be called after ReduceCopier::fetchOutputs returns false
* Code handling FSError should be in a catch block, not handled using instanceof in a method call from a catch of Throwable. The retry loop is unnecessary. The call to System.exit is overly aggressive. (i.e. handleIfFSError should not exist)
* Discarding map output cannot generate FSError and does not require handling.

This should be replaced with a catch of FSError before Throwable in MapOutputCopier::run that calls umbilical.fsError (if it throws, the exception can be logged and ignored). If reduceCopier.fetchOutputs returns false, then reduceCopier.mergeThrowable should be the cause of the thrown exception (it's OK if it's null). If mergeThrowable is FSError, it would be reasonable to call umbilical.fsError before the throw.

> If a reducer failed at shuffling stage, the task should fail, not just logging an exception
> -------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4163
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4163
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Runping Qi
>            Assignee: Sharad Agarwal
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: 4163_v1.patch
>
>
> I saw a reducer stuck at the shuffling stage, with the following exception logged in the log file:
> 2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
> 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> 	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
> 	at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
> 	at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
> Caused by: java.io.IOException: No space left on device
> 	at java.io.FileOutputStream.writeBytes(Native Method)
> 	at java.io.FileOutputStream.write(FileOutputStream.java:260)
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
> 	... 11 more
> 2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
> java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed
> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
> The task should have died.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4163) If a reducer failed at shuffling stage, the task should fail, not just logging an exception

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12637072#action_12637072 ] 

Hadoop QA commented on HADOOP-4163:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12391525/4163_v3.patch
  against trunk revision 701948.

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no tests are needed for this patch.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

    -1 core tests.  The patch failed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3435/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3435/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3435/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3435/console

This message is automatically generated.

> If a reducer failed at shuffling stage, the task should fail, not just logging an exception
> -------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4163
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4163
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Runping Qi
>            Assignee: Sharad Agarwal
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: 4163_v1.patch, 4163_v2.patch, 4163_v3.patch
>
>
> I saw a reducer stuck at the shuffling stage, with the following exception logged in the log file:
> 2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
> 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> 	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
> 	at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
> 	at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
> Caused by: java.io.IOException: No space left on device
> 	at java.io.FileOutputStream.writeBytes(Native Method)
> 	at java.io.FileOutputStream.write(FileOutputStream.java:260)
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
> 	... 11 more
> 2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
> java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed
> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
> The task should have died.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4163) If a reducer failed at shuffling stage, the task should fail, not just logging an exception

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12633214#action_12633214 ] 

Amareshwari Sriramadasu commented on HADOOP-4163:
-------------------------------------------------

Or is it same as HADOOP-4115?

> If a reducer failed at shuffling stage, the task should fail, not just logging an exception
> -------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4163
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4163
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Runping Qi
>            Assignee: Amareshwari Sriramadasu
>            Priority: Blocker
>             Fix For: 0.19.0
>
>
> I saw a reducer stuck at the shuffling stage, with the following exception logged in the log file:
> 2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
> 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> 	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
> 	at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
> 	at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
> Caused by: java.io.IOException: No space left on device
> 	at java.io.FileOutputStream.writeBytes(Native Method)
> 	at java.io.FileOutputStream.write(FileOutputStream.java:260)
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
> 	... 11 more
> 2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
> java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed
> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
> The task should have died.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4163) If a reducer failed at shuffling stage, the task should fail, not just logging an exception

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-4163:
----------------------------------

      Resolution: Fixed
    Hadoop Flags: [Reviewed]
          Status: Resolved  (was: Patch Available)

I just committed this. Thanks, Sharad

> If a reducer failed at shuffling stage, the task should fail, not just logging an exception
> -------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4163
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4163
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Runping Qi
>            Assignee: Sharad Agarwal
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: 4163_v1.patch, 4163_v2.patch, 4163_v3.patch
>
>
> I saw a reducer stuck at the shuffling stage, with the following exception logged in the log file:
> 2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
> 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> 	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
> 	at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
> 	at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
> Caused by: java.io.IOException: No space left on device
> 	at java.io.FileOutputStream.writeBytes(Native Method)
> 	at java.io.FileOutputStream.write(FileOutputStream.java:260)
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
> 	... 11 more
> 2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
> java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed
> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
> The task should have died.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4163) If a reducer failed at shuffling stage, the task should fail, not just logging an exception

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635798#action_12635798 ] 

Arun C Murthy commented on HADOOP-4163:
---------------------------------------

bq. But I am slightly apprehensive about the implication of this change this late in the game.. Thoughts ? 

I agree, we probably should fix that in 0.20.0.

> If a reducer failed at shuffling stage, the task should fail, not just logging an exception
> -------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4163
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4163
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Runping Qi
>            Assignee: Sharad Agarwal
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: 4163_v1.patch, 4163_v2.patch
>
>
> I saw a reducer stuck at the shuffling stage, with the following exception logged in the log file:
> 2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
> 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> 	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
> 	at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
> 	at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
> Caused by: java.io.IOException: No space left on device
> 	at java.io.FileOutputStream.writeBytes(Native Method)
> 	at java.io.FileOutputStream.write(FileOutputStream.java:260)
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
> 	... 11 more
> 2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
> java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed
> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
> The task should have died.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4163) If a reducer failed at shuffling stage, the task should fail, not just logging an exception

Posted by "Sharad Agarwal (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sharad Agarwal updated HADOOP-4163:
-----------------------------------

    Status: Patch Available  (was: Open)

> If a reducer failed at shuffling stage, the task should fail, not just logging an exception
> -------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4163
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4163
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Runping Qi
>            Assignee: Sharad Agarwal
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: 4163_v1.patch
>
>
> I saw a reducer stuck at the shuffling stage, with the following exception logged in the log file:
> 2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
> 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> 	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
> 	at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
> 	at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
> Caused by: java.io.IOException: No space left on device
> 	at java.io.FileOutputStream.writeBytes(Native Method)
> 	at java.io.FileOutputStream.write(FileOutputStream.java:260)
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
> 	... 11 more
> 2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
> java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed
> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
> The task should have died.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4163) If a reducer failed at shuffling stage, the task should fail, not just logging an exception

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12637277#action_12637277 ] 

Chris Douglas commented on HADOOP-4163:
---------------------------------------

+1

> If a reducer failed at shuffling stage, the task should fail, not just logging an exception
> -------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4163
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4163
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Runping Qi
>            Assignee: Sharad Agarwal
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: 4163_v1.patch, 4163_v2.patch, 4163_v3.patch
>
>
> I saw a reducer stuck at the shuffling stage, with the following exception logged in the log file:
> 2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
> 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> 	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
> 	at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
> 	at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
> Caused by: java.io.IOException: No space left on device
> 	at java.io.FileOutputStream.writeBytes(Native Method)
> 	at java.io.FileOutputStream.write(FileOutputStream.java:260)
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
> 	... 11 more
> 2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
> java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed
> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
> The task should have died.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4163) If a reducer failed at shuffling stage, the task should fail, not just logging an exception

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HADOOP-4163:
--------------------------------

      Description: 

I saw a reducer stuck at the shuffling stage, with the following exception logged in the log file:

2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
	at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
	at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185)
	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
Caused by: java.io.IOException: No space left on device
	at java.io.FileOutputStream.writeBytes(Native Method)
	at java.io.FileOutputStream.write(FileOutputStream.java:260)
	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
	... 11 more

2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed
	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)

The task should have died.



  was:


I saw a reducer stuck at the shuffling stage, with the following exception logged in the log file:

2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
	at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
	at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185)
	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
Caused by: java.io.IOException: No space left on device
	at java.io.FileOutputStream.writeBytes(Native Method)
	at java.io.FileOutputStream.write(FileOutputStream.java:260)
	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
	... 11 more

2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed
	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)

The task should have died.



         Priority: Blocker  (was: Major)
    Fix Version/s: 0.19.0
         Assignee: Amareshwari Sriramadasu

Marking this as a blocker until we get to the root cause..

> If a reducer failed at shuffling stage, the task should fail, not just logging an exception
> -------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4163
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4163
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Runping Qi
>            Assignee: Amareshwari Sriramadasu
>            Priority: Blocker
>             Fix For: 0.19.0
>
>
> I saw a reducer stuck at the shuffling stage, with the following exception logged in the log file:
> 2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
> 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> 	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
> 	at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
> 	at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
> Caused by: java.io.IOException: No space left on device
> 	at java.io.FileOutputStream.writeBytes(Native Method)
> 	at java.io.FileOutputStream.write(FileOutputStream.java:260)
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
> 	... 11 more
> 2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
> java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed
> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
> The task should have died.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4163) If a reducer failed at shuffling stage, the task should fail, not just logging an exception

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12633207#action_12633207 ] 

Amareshwari Sriramadasu commented on HADOOP-4163:
-------------------------------------------------

Runping, can you give some information about the job?
1. What is typical map runtime in your job? Is it less than 4 seconds?
2. What is the value of *mapred.reduce.copy.backoff* in your configuration?
3. Were there any maps re-executed because of Too many fetch failures? 
4. The log you have attached has *Error running child* from TaskTracker. So, has the reducer died eventually? If so, are you saying that the reducer spent a lot of time in shuffle before it could die? 
5. What happened to the job finally? Did it succeed?


> If a reducer failed at shuffling stage, the task should fail, not just logging an exception
> -------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4163
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4163
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Runping Qi
>            Assignee: Amareshwari Sriramadasu
>            Priority: Blocker
>             Fix For: 0.19.0
>
>
> I saw a reducer stuck at the shuffling stage, with the following exception logged in the log file:
> 2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
> 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> 	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
> 	at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
> 	at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
> Caused by: java.io.IOException: No space left on device
> 	at java.io.FileOutputStream.writeBytes(Native Method)
> 	at java.io.FileOutputStream.write(FileOutputStream.java:260)
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
> 	... 11 more
> 2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
> java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed
> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
> The task should have died.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4163) If a reducer failed at shuffling stage, the task should fail, not just logging an exception

Posted by "Sharad Agarwal (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sharad Agarwal updated HADOOP-4163:
-----------------------------------

    Attachment: 4163_v3.patch

incorporated Chris' feedback

> If a reducer failed at shuffling stage, the task should fail, not just logging an exception
> -------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4163
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4163
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Runping Qi
>            Assignee: Sharad Agarwal
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: 4163_v1.patch, 4163_v2.patch, 4163_v3.patch
>
>
> I saw a reducer stuck at the shuffling stage, with the following exception logged in the log file:
> 2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
> 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> 	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
> 	at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
> 	at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
> Caused by: java.io.IOException: No space left on device
> 	at java.io.FileOutputStream.writeBytes(Native Method)
> 	at java.io.FileOutputStream.write(FileOutputStream.java:260)
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
> 	... 11 more
> 2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
> java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed
> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
> The task should have died.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4163) If a reducer failed at shuffling stage, the task should fail, not just logging an exception

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635736#action_12635736 ] 

Devaraj Das commented on HADOOP-4163:
-------------------------------------

Today, if the copier thread (ReduceTask.ReduceCopier.MapOutputCopier.run()) throws a Throwable, it is logged an ignored. I am wondering whether it makes sense to treat all exceptions except IOExceptions (mostly due to network issues) as fatal. Here is one thought -
Rename mergeThrowable to shuffleThrowable. In the copier thread, we could set shuffleThrowable when Throwable is caught (IOException is caught separately already). In all the places where mergeThrowable is set, we could set shuffleThrowable. The loop inside fetchOutputs could check whether shuffleThrowable is non-null.
When fetchOutputs returns with a 'false', we could check whether the shuffleThrowable is an instance of Error and if so, throw the Error out. In the other case, we could wrap it in an IOException. Doing it in the above way would mean that we call umbilical.fsError at exactly one place - in Child.main(). 
But I am slightly apprehensive about the implication of this change this late in the game.. Thoughts ?


> If a reducer failed at shuffling stage, the task should fail, not just logging an exception
> -------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4163
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4163
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Runping Qi
>            Assignee: Sharad Agarwal
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: 4163_v1.patch, 4163_v2.patch
>
>
> I saw a reducer stuck at the shuffling stage, with the following exception logged in the log file:
> 2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
> 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> 	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
> 	at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
> 	at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
> Caused by: java.io.IOException: No space left on device
> 	at java.io.FileOutputStream.writeBytes(Native Method)
> 	at java.io.FileOutputStream.write(FileOutputStream.java:260)
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
> 	... 11 more
> 2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
> java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed
> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
> The task should have died.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4163) If a reducer failed at shuffling stage, the task should fail, not just logging an exception

Posted by "Sharad Agarwal (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sharad Agarwal updated HADOOP-4163:
-----------------------------------

    Attachment: 4163_v1.patch

patch with the simple fix. It adds the check for FSError. In case FSError is encountered, notify the TT about it, so that TT can purge this task.

> If a reducer failed at shuffling stage, the task should fail, not just logging an exception
> -------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4163
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4163
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Runping Qi
>            Assignee: Sharad Agarwal
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: 4163_v1.patch
>
>
> I saw a reducer stuck at the shuffling stage, with the following exception logged in the log file:
> 2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
> 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> 	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
> 	at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
> 	at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
> Caused by: java.io.IOException: No space left on device
> 	at java.io.FileOutputStream.writeBytes(Native Method)
> 	at java.io.FileOutputStream.write(FileOutputStream.java:260)
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
> 	... 11 more
> 2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
> java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed
> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
> The task should have died.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4163) If a reducer failed at shuffling stage, the task should fail, not just logging an exception

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635812#action_12635812 ] 

Devaraj Das commented on HADOOP-4163:
-------------------------------------

Agree with you Chris and Arun.. Let's do the code improvement later.

> If a reducer failed at shuffling stage, the task should fail, not just logging an exception
> -------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4163
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4163
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Runping Qi
>            Assignee: Sharad Agarwal
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: 4163_v1.patch, 4163_v2.patch
>
>
> I saw a reducer stuck at the shuffling stage, with the following exception logged in the log file:
> 2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
> 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> 	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
> 	at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
> 	at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
> Caused by: java.io.IOException: No space left on device
> 	at java.io.FileOutputStream.writeBytes(Native Method)
> 	at java.io.FileOutputStream.write(FileOutputStream.java:260)
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
> 	... 11 more
> 2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
> java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed
> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
> The task should have died.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4163) If a reducer failed at shuffling stage, the task should fail, not just logging an exception

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12637478#action_12637478 ] 

Hudson commented on HADOOP-4163:
--------------------------------

Integrated in Hadoop-trunk #626 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/626/])
    . Report FSErrors from map output fetch threads instead of
merely logging them. Contributed by Sharad Agarwal.


> If a reducer failed at shuffling stage, the task should fail, not just logging an exception
> -------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4163
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4163
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Runping Qi
>            Assignee: Sharad Agarwal
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: 4163_v1.patch, 4163_v2.patch, 4163_v3.patch
>
>
> I saw a reducer stuck at the shuffling stage, with the following exception logged in the log file:
> 2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
> 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> 	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
> 	at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
> 	at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
> Caused by: java.io.IOException: No space left on device
> 	at java.io.FileOutputStream.writeBytes(Native Method)
> 	at java.io.FileOutputStream.write(FileOutputStream.java:260)
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
> 	... 11 more
> 2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
> java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed
> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
> The task should have died.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4163) If a reducer failed at shuffling stage, the task should fail, not just logging an exception

Posted by "Sharad Agarwal (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sharad Agarwal updated HADOOP-4163:
-----------------------------------

    Status: Patch Available  (was: Open)

> If a reducer failed at shuffling stage, the task should fail, not just logging an exception
> -------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4163
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4163
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Runping Qi
>            Assignee: Sharad Agarwal
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: 4163_v1.patch, 4163_v2.patch, 4163_v3.patch
>
>
> I saw a reducer stuck at the shuffling stage, with the following exception logged in the log file:
> 2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
> 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> 	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
> 	at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
> 	at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
> Caused by: java.io.IOException: No space left on device
> 	at java.io.FileOutputStream.writeBytes(Native Method)
> 	at java.io.FileOutputStream.write(FileOutputStream.java:260)
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
> 	... 11 more
> 2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
> java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed
> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
> The task should have died.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4163) If a reducer failed at shuffling stage, the task should fail, not just logging an exception

Posted by "Sharad Agarwal (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12637076#action_12637076 ] 

Sharad Agarwal commented on HADOOP-4163:
----------------------------------------

test failure is unrelated.

> If a reducer failed at shuffling stage, the task should fail, not just logging an exception
> -------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4163
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4163
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Runping Qi
>            Assignee: Sharad Agarwal
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: 4163_v1.patch, 4163_v2.patch, 4163_v3.patch
>
>
> I saw a reducer stuck at the shuffling stage, with the following exception logged in the log file:
> 2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
> 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> 	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
> 	at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
> 	at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
> Caused by: java.io.IOException: No space left on device
> 	at java.io.FileOutputStream.writeBytes(Native Method)
> 	at java.io.FileOutputStream.write(FileOutputStream.java:260)
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
> 	... 11 more
> 2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
> java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed
> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
> The task should have died.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4163) If a reducer failed at shuffling stage, the task should fail, not just logging an exception

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12634379#action_12634379 ] 

Amareshwari Sriramadasu commented on HADOOP-4163:
-------------------------------------------------

Runping, Sorry for asking this late. Is it possible to get JobTracker and TaskTracker for the reduce task?

> If a reducer failed at shuffling stage, the task should fail, not just logging an exception
> -------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4163
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4163
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Runping Qi
>            Assignee: Amareshwari Sriramadasu
>            Priority: Blocker
>             Fix For: 0.19.0
>
>
> I saw a reducer stuck at the shuffling stage, with the following exception logged in the log file:
> 2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
> 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> 	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
> 	at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
> 	at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
> Caused by: java.io.IOException: No space left on device
> 	at java.io.FileOutputStream.writeBytes(Native Method)
> 	at java.io.FileOutputStream.write(FileOutputStream.java:260)
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
> 	... 11 more
> 2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
> java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed
> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
> The task should have died.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HADOOP-4163) If a reducer failed at shuffling stage, the task should fail, not just logging an exception

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das reassigned HADOOP-4163:
-----------------------------------

    Assignee: Sharad Agarwal  (was: Amareshwari Sriramadasu)

> If a reducer failed at shuffling stage, the task should fail, not just logging an exception
> -------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4163
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4163
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Runping Qi
>            Assignee: Sharad Agarwal
>            Priority: Blocker
>             Fix For: 0.19.0
>
>
> I saw a reducer stuck at the shuffling stage, with the following exception logged in the log file:
> 2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
> 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> 	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
> 	at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
> 	at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
> Caused by: java.io.IOException: No space left on device
> 	at java.io.FileOutputStream.writeBytes(Native Method)
> 	at java.io.FileOutputStream.write(FileOutputStream.java:260)
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
> 	... 11 more
> 2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
> java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed
> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
> The task should have died.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4163) If a reducer failed at shuffling stage, the task should fail, not just logging an exception

Posted by "Runping Qi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12634531#action_12634531 ] 

Runping Qi commented on HADOOP-4163:
------------------------------------


They are all gone.

I'll keep them next time.


> If a reducer failed at shuffling stage, the task should fail, not just logging an exception
> -------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4163
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4163
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Runping Qi
>            Assignee: Amareshwari Sriramadasu
>            Priority: Blocker
>             Fix For: 0.19.0
>
>
> I saw a reducer stuck at the shuffling stage, with the following exception logged in the log file:
> 2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
> 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> 	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
> 	at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
> 	at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
> Caused by: java.io.IOException: No space left on device
> 	at java.io.FileOutputStream.writeBytes(Native Method)
> 	at java.io.FileOutputStream.write(FileOutputStream.java:260)
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
> 	... 11 more
> 2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
> java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed
> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
> The task should have died.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4163) If a reducer failed at shuffling stage, the task should fail, not just logging an exception

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635807#action_12635807 ] 

Chris Douglas commented on HADOOP-4163:
---------------------------------------

* Calling fsError with the message from the exception would probably be more useful
* Instead of rethrowing FSError, setting reduceCopier.mergeThrowable as the cause of the IOE thrown is both more polite and also useful when the merge fails
* The check for null before instanceof is [redundant|http://java.sun.com/docs/books/jls/third_edition/html/expressions.html#15.20.2]

bq. I am wondering whether it makes sense to treat all exceptions except IOExceptions (mostly due to network issues) as fatal [...]
If we ignore IOException, that leaves unchecked exceptions and Errors. Other than FSError, what do we expect, or why we would expect other errors from fetch threads to kill the task profitably? I think it will improve the structure of the code, but it seems risky for 0.19 unless we observe other errors that should kill the task and don't.

> If a reducer failed at shuffling stage, the task should fail, not just logging an exception
> -------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4163
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4163
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Runping Qi
>            Assignee: Sharad Agarwal
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: 4163_v1.patch, 4163_v2.patch
>
>
> I saw a reducer stuck at the shuffling stage, with the following exception logged in the log file:
> 2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
> 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> 	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
> 	at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
> 	at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
> Caused by: java.io.IOException: No space left on device
> 	at java.io.FileOutputStream.writeBytes(Native Method)
> 	at java.io.FileOutputStream.write(FileOutputStream.java:260)
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
> 	... 11 more
> 2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
> java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed
> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
> The task should have died.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.