You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-issues@hadoop.apache.org by "Jason Lowe (Created) (JIRA)" <ji...@apache.org> on 2012/02/02 19:04:55 UTC

[jira] [Created] (MAPREDUCE-3790) Broken pipe on streaming job can lead to truncated output for a successful job

Broken pipe on streaming job can lead to truncated output for a successful job
------------------------------------------------------------------------------

                 Key: MAPREDUCE-3790
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3790
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: contrib/streaming
    Affects Versions: 0.23.1, 0.24.0
            Reporter: Jason Lowe


If a streaming job doesn't consume all of its input then the job can be marked successful even though the job's output is truncated.

Here's a simple setup that can exhibit the problem.  Note that the job output will most likely be truncated compared to the same job run with a zero-length input file.

{code}
$ hdfs dfs -cat in
foo
$ yarn jar ./share/hadoop/tools/lib/hadoop-streaming-0.24.0-SNAPSHOT.jar -Dmapred.map.tasks=1 -Dmapred.reduce.tasks=1 -mapper /bin/env -reducer NONE -input in -output out
{code}

Examining the map task log shows this:

{code:title=Excerpt from map task stdout log}
2012-02-02 11:27:25,054 WARN [main] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: Broken pipe
2012-02-02 11:27:25,054 INFO [main] org.apache.hadoop.streaming.PipeMapRed: mapRedFinished
2012-02-02 11:27:25,056 WARN [Thread-12] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: Bad file descriptor
2012-02-02 11:27:25,124 INFO [main] org.apache.hadoop.mapred.Task: Task:attempt_1328203555769_0001_m_000000_0 is done. And is in the process of commiting
2012-02-02 11:27:25,127 WARN [Thread-11] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: DFSOutputStream is closed
2012-02-02 11:27:25,199 INFO [main] org.apache.hadoop.mapred.Task: Task attempt_1328203555769_0001_m_000000_0 is allowed to commit now
2012-02-02 11:27:25,225 INFO [main] org.apache.hadoop.mapred.FileOutputCommitter: Saved output of task 'attempt_1328203555769_0001_m_000000_0' to hdfs://localhost:9000/user/somebody/out/_temporary/1
2012-02-02 11:27:27,834 INFO [main] org.apache.hadoop.mapred.Task: Task 'attempt_1328203555769_0001_m_000000_0' done.
{code}

In PipeMapRed.mapRedFinished() we can see it will eat IOExceptions and return without waiting for the output threads or throwing a runtime exception to fail the job.  Net result is that the DFS streams could be shutdown too early if the output threads are still busy and we could lose job output.

Fixing this brings up the bigger question of what *should* happen when a streaming job doesn't consume all of its input.  Should we have grabbed all of the output from the job and still marked it successful or should we have failed the job?  If the former then we need to fix some other places in the code as well, since feeding a much larger input file (e.g.: 600K) to the same sample streaming job results in the job failing with the exception below.  It wouldn't be consistent to fail the job that doesn't consume a lot of input but pass the job that leaves just a few leftovers.

{code}
2012-02-02 10:29:37,220 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1270)) - Running job: job_1328200108174_0001
2012-02-02 10:29:44,354 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1291)) - Job job_1328200108174_0001 running in uber mode : false
2012-02-02 10:29:44,355 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1298)) -  map 0% reduce 0%
2012-02-02 10:29:46,394 INFO  mapreduce.Job (Job.java:printTaskEvents(1386)) - Task Id : attempt_1328200108174_0001_m_000000_0, Status : FAILED
Error: java.io.IOException: Broken pipe
	at java.io.FileOutputStream.writeBytes(Native Method)
	at java.io.FileOutputStream.write(FileOutputStream.java:282)
	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
	at java.io.DataOutputStream.write(DataOutputStream.java:90)
	at org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72)
	at org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51)
	at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:106)
	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
	at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:394)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:329)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:147)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:396)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:142)
{code}

Assuming the job returns a successful exit code, I think we should allow the job to complete successfully even though it doesn't consume all of its inputs.  Part of the reasoning is that there's already this comment in PipeMapper.java that implies we desire that behavior:

{code:title=PipeMapper.java}
        // terminate with success:
        // swallow input records although the stream processor failed/closed
{code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-3790) Broken pipe on streaming job can lead to truncated output for a successful job

Posted by "Jason Lowe (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Lowe updated MAPREDUCE-3790:
----------------------------------

    Attachment: MAPREDUCE-3790.patch

Patch changes mapRedFinished() so we try to wait for the output threads to complete before returning even if there was an IOException trying to flush and close the input stream.

Added test case to verify functionality of stream.minRecWrittenToEnableSkip_=0 and manually verified the /bin/env test case has been fixed with this patch.
                
> Broken pipe on streaming job can lead to truncated output for a successful job
> ------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3790
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3790
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: contrib/streaming
>    Affects Versions: 0.23.1, 0.24.0
>            Reporter: Jason Lowe
>         Attachments: MAPREDUCE-3790.patch
>
>
> If a streaming job doesn't consume all of its input then the job can be marked successful even though the job's output is truncated.
> Here's a simple setup that can exhibit the problem.  Note that the job output will most likely be truncated compared to the same job run with a zero-length input file.
> {code}
> $ hdfs dfs -cat in
> foo
> $ yarn jar ./share/hadoop/tools/lib/hadoop-streaming-0.24.0-SNAPSHOT.jar -Dmapred.map.tasks=1 -Dmapred.reduce.tasks=1 -mapper /bin/env -reducer NONE -input in -output out
> {code}
> Examining the map task log shows this:
> {code:title=Excerpt from map task stdout log}
> 2012-02-02 11:27:25,054 WARN [main] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: Broken pipe
> 2012-02-02 11:27:25,054 INFO [main] org.apache.hadoop.streaming.PipeMapRed: mapRedFinished
> 2012-02-02 11:27:25,056 WARN [Thread-12] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: Bad file descriptor
> 2012-02-02 11:27:25,124 INFO [main] org.apache.hadoop.mapred.Task: Task:attempt_1328203555769_0001_m_000000_0 is done. And is in the process of commiting
> 2012-02-02 11:27:25,127 WARN [Thread-11] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: DFSOutputStream is closed
> 2012-02-02 11:27:25,199 INFO [main] org.apache.hadoop.mapred.Task: Task attempt_1328203555769_0001_m_000000_0 is allowed to commit now
> 2012-02-02 11:27:25,225 INFO [main] org.apache.hadoop.mapred.FileOutputCommitter: Saved output of task 'attempt_1328203555769_0001_m_000000_0' to hdfs://localhost:9000/user/somebody/out/_temporary/1
> 2012-02-02 11:27:27,834 INFO [main] org.apache.hadoop.mapred.Task: Task 'attempt_1328203555769_0001_m_000000_0' done.
> {code}
> In PipeMapRed.mapRedFinished() we can see it will eat IOExceptions and return without waiting for the output threads or throwing a runtime exception to fail the job.  Net result is that the DFS streams could be shutdown too early if the output threads are still busy and we could lose job output.
> Fixing this brings up the bigger question of what *should* happen when a streaming job doesn't consume all of its input.  Should we have grabbed all of the output from the job and still marked it successful or should we have failed the job?  If the former then we need to fix some other places in the code as well, since feeding a much larger input file (e.g.: 600K) to the same sample streaming job results in the job failing with the exception below.  It wouldn't be consistent to fail the job that doesn't consume a lot of input but pass the job that leaves just a few leftovers.
> {code}
> 2012-02-02 10:29:37,220 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1270)) - Running job: job_1328200108174_0001
> 2012-02-02 10:29:44,354 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1291)) - Job job_1328200108174_0001 running in uber mode : false
> 2012-02-02 10:29:44,355 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1298)) -  map 0% reduce 0%
> 2012-02-02 10:29:46,394 INFO  mapreduce.Job (Job.java:printTaskEvents(1386)) - Task Id : attempt_1328200108174_0001_m_000000_0, Status : FAILED
> Error: java.io.IOException: Broken pipe
> 	at java.io.FileOutputStream.writeBytes(Native Method)
> 	at java.io.FileOutputStream.write(FileOutputStream.java:282)
> 	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
> 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> 	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
> 	at java.io.DataOutputStream.write(DataOutputStream.java:90)
> 	at org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72)
> 	at org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51)
> 	at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:106)
> 	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
> 	at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
> 	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:394)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:329)
> 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:147)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
> 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:142)
> {code}
> Assuming the job returns a successful exit code, I think we should allow the job to complete successfully even though it doesn't consume all of its inputs.  Part of the reasoning is that there's already this comment in PipeMapper.java that implies we desire that behavior:
> {code:title=PipeMapper.java}
>         // terminate with success:
>         // swallow input records although the stream processor failed/closed
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3790) Broken pipe on streaming job can lead to truncated output for a successful job

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13218408#comment-13218408 ] 

Hudson commented on MAPREDUCE-3790:
-----------------------------------

Integrated in Hadoop-Common-0.23-Commit #607 (See [https://builds.apache.org/job/Hadoop-Common-0.23-Commit/607/])
    svn merge -c 1294750 trunk to branch-0.23 FIXES MAPREDUCE-3790 Broken pipe on streaming job can lead to truncated output for a successful job (Jason Lowe via bobby) (Revision 1294751)
svn merge -c 1294743 trunk to branch-0.23 FIXES MAPREDUCE-3790 Broken pipe on streaming job can lead to truncated output for a successful job (Jason Lowe via bobby) (Revision 1294747)

     Result = SUCCESS
bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1294751
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt

bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1294747
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-tools/hadoop-streaming/src/main/java/org/apache/hadoop/streaming/PipeMapRed.java
* /hadoop/common/branches/branch-0.23/hadoop-tools/hadoop-streaming/src/test/java/org/apache/hadoop/streaming/OutputOnlyApp.java
* /hadoop/common/branches/branch-0.23/hadoop-tools/hadoop-streaming/src/test/java/org/apache/hadoop/streaming/TestUnconsumedInput.java

                
> Broken pipe on streaming job can lead to truncated output for a successful job
> ------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3790
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3790
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: contrib/streaming, mrv2
>    Affects Versions: 0.23.1, 0.24.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3790.patch
>
>
> If a streaming job doesn't consume all of its input then the job can be marked successful even though the job's output is truncated.
> Here's a simple setup that can exhibit the problem.  Note that the job output will most likely be truncated compared to the same job run with a zero-length input file.
> {code}
> $ hdfs dfs -cat in
> foo
> $ yarn jar ./share/hadoop/tools/lib/hadoop-streaming-0.24.0-SNAPSHOT.jar -Dmapred.map.tasks=1 -Dmapred.reduce.tasks=1 -mapper /bin/env -reducer NONE -input in -output out
> {code}
> Examining the map task log shows this:
> {code:title=Excerpt from map task stdout log}
> 2012-02-02 11:27:25,054 WARN [main] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: Broken pipe
> 2012-02-02 11:27:25,054 INFO [main] org.apache.hadoop.streaming.PipeMapRed: mapRedFinished
> 2012-02-02 11:27:25,056 WARN [Thread-12] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: Bad file descriptor
> 2012-02-02 11:27:25,124 INFO [main] org.apache.hadoop.mapred.Task: Task:attempt_1328203555769_0001_m_000000_0 is done. And is in the process of commiting
> 2012-02-02 11:27:25,127 WARN [Thread-11] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: DFSOutputStream is closed
> 2012-02-02 11:27:25,199 INFO [main] org.apache.hadoop.mapred.Task: Task attempt_1328203555769_0001_m_000000_0 is allowed to commit now
> 2012-02-02 11:27:25,225 INFO [main] org.apache.hadoop.mapred.FileOutputCommitter: Saved output of task 'attempt_1328203555769_0001_m_000000_0' to hdfs://localhost:9000/user/somebody/out/_temporary/1
> 2012-02-02 11:27:27,834 INFO [main] org.apache.hadoop.mapred.Task: Task 'attempt_1328203555769_0001_m_000000_0' done.
> {code}
> In PipeMapRed.mapRedFinished() we can see it will eat IOExceptions and return without waiting for the output threads or throwing a runtime exception to fail the job.  Net result is that the DFS streams could be shutdown too early if the output threads are still busy and we could lose job output.
> Fixing this brings up the bigger question of what *should* happen when a streaming job doesn't consume all of its input.  Should we have grabbed all of the output from the job and still marked it successful or should we have failed the job?  If the former then we need to fix some other places in the code as well, since feeding a much larger input file (e.g.: 600K) to the same sample streaming job results in the job failing with the exception below.  It wouldn't be consistent to fail the job that doesn't consume a lot of input but pass the job that leaves just a few leftovers.
> {code}
> 2012-02-02 10:29:37,220 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1270)) - Running job: job_1328200108174_0001
> 2012-02-02 10:29:44,354 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1291)) - Job job_1328200108174_0001 running in uber mode : false
> 2012-02-02 10:29:44,355 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1298)) -  map 0% reduce 0%
> 2012-02-02 10:29:46,394 INFO  mapreduce.Job (Job.java:printTaskEvents(1386)) - Task Id : attempt_1328200108174_0001_m_000000_0, Status : FAILED
> Error: java.io.IOException: Broken pipe
> 	at java.io.FileOutputStream.writeBytes(Native Method)
> 	at java.io.FileOutputStream.write(FileOutputStream.java:282)
> 	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
> 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> 	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
> 	at java.io.DataOutputStream.write(DataOutputStream.java:90)
> 	at org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72)
> 	at org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51)
> 	at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:106)
> 	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
> 	at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
> 	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:394)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:329)
> 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:147)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
> 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:142)
> {code}
> Assuming the job returns a successful exit code, I think we should allow the job to complete successfully even though it doesn't consume all of its inputs.  Part of the reasoning is that there's already this comment in PipeMapper.java that implies we desire that behavior:
> {code:title=PipeMapper.java}
>         // terminate with success:
>         // swallow input records although the stream processor failed/closed
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3790) Broken pipe on streaming job can lead to truncated output for a successful job

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13218412#comment-13218412 ] 

Hudson commented on MAPREDUCE-3790:
-----------------------------------

Integrated in Hadoop-Hdfs-0.23-Commit #595 (See [https://builds.apache.org/job/Hadoop-Hdfs-0.23-Commit/595/])
    svn merge -c 1294750 trunk to branch-0.23 FIXES MAPREDUCE-3790 Broken pipe on streaming job can lead to truncated output for a successful job (Jason Lowe via bobby) (Revision 1294751)

     Result = SUCCESS
bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1294751
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt

                
> Broken pipe on streaming job can lead to truncated output for a successful job
> ------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3790
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3790
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: contrib/streaming, mrv2
>    Affects Versions: 0.23.1, 0.24.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3790.patch
>
>
> If a streaming job doesn't consume all of its input then the job can be marked successful even though the job's output is truncated.
> Here's a simple setup that can exhibit the problem.  Note that the job output will most likely be truncated compared to the same job run with a zero-length input file.
> {code}
> $ hdfs dfs -cat in
> foo
> $ yarn jar ./share/hadoop/tools/lib/hadoop-streaming-0.24.0-SNAPSHOT.jar -Dmapred.map.tasks=1 -Dmapred.reduce.tasks=1 -mapper /bin/env -reducer NONE -input in -output out
> {code}
> Examining the map task log shows this:
> {code:title=Excerpt from map task stdout log}
> 2012-02-02 11:27:25,054 WARN [main] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: Broken pipe
> 2012-02-02 11:27:25,054 INFO [main] org.apache.hadoop.streaming.PipeMapRed: mapRedFinished
> 2012-02-02 11:27:25,056 WARN [Thread-12] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: Bad file descriptor
> 2012-02-02 11:27:25,124 INFO [main] org.apache.hadoop.mapred.Task: Task:attempt_1328203555769_0001_m_000000_0 is done. And is in the process of commiting
> 2012-02-02 11:27:25,127 WARN [Thread-11] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: DFSOutputStream is closed
> 2012-02-02 11:27:25,199 INFO [main] org.apache.hadoop.mapred.Task: Task attempt_1328203555769_0001_m_000000_0 is allowed to commit now
> 2012-02-02 11:27:25,225 INFO [main] org.apache.hadoop.mapred.FileOutputCommitter: Saved output of task 'attempt_1328203555769_0001_m_000000_0' to hdfs://localhost:9000/user/somebody/out/_temporary/1
> 2012-02-02 11:27:27,834 INFO [main] org.apache.hadoop.mapred.Task: Task 'attempt_1328203555769_0001_m_000000_0' done.
> {code}
> In PipeMapRed.mapRedFinished() we can see it will eat IOExceptions and return without waiting for the output threads or throwing a runtime exception to fail the job.  Net result is that the DFS streams could be shutdown too early if the output threads are still busy and we could lose job output.
> Fixing this brings up the bigger question of what *should* happen when a streaming job doesn't consume all of its input.  Should we have grabbed all of the output from the job and still marked it successful or should we have failed the job?  If the former then we need to fix some other places in the code as well, since feeding a much larger input file (e.g.: 600K) to the same sample streaming job results in the job failing with the exception below.  It wouldn't be consistent to fail the job that doesn't consume a lot of input but pass the job that leaves just a few leftovers.
> {code}
> 2012-02-02 10:29:37,220 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1270)) - Running job: job_1328200108174_0001
> 2012-02-02 10:29:44,354 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1291)) - Job job_1328200108174_0001 running in uber mode : false
> 2012-02-02 10:29:44,355 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1298)) -  map 0% reduce 0%
> 2012-02-02 10:29:46,394 INFO  mapreduce.Job (Job.java:printTaskEvents(1386)) - Task Id : attempt_1328200108174_0001_m_000000_0, Status : FAILED
> Error: java.io.IOException: Broken pipe
> 	at java.io.FileOutputStream.writeBytes(Native Method)
> 	at java.io.FileOutputStream.write(FileOutputStream.java:282)
> 	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
> 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> 	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
> 	at java.io.DataOutputStream.write(DataOutputStream.java:90)
> 	at org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72)
> 	at org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51)
> 	at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:106)
> 	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
> 	at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
> 	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:394)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:329)
> 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:147)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
> 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:142)
> {code}
> Assuming the job returns a successful exit code, I think we should allow the job to complete successfully even though it doesn't consume all of its inputs.  Part of the reasoning is that there's already this comment in PipeMapper.java that implies we desire that behavior:
> {code:title=PipeMapper.java}
>         // terminate with success:
>         // swallow input records although the stream processor failed/closed
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3790) Broken pipe on streaming job can lead to truncated output for a successful job

Posted by "Jason Lowe (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13200198#comment-13200198 ] 

Jason Lowe commented on MAPREDUCE-3790:
---------------------------------------

test-patch.sh has issues with patches touching hadoop-tools, so I manually test-patch from the root:

-1 overall.  

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 4 new or modified tests.

    -1 javadoc.  The javadoc tool appears to have generated 18 warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 eclipse:eclipse.  The patch built with eclipse:eclipse.

    +1 findbugs.  The patch does not introduce any new Findbugs (version ) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.


Note that the javadoc warnings are unrelated.  I manually ran the additional test case and it passed.

                
> Broken pipe on streaming job can lead to truncated output for a successful job
> ------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3790
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3790
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: contrib/streaming
>    Affects Versions: 0.23.1, 0.24.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>         Attachments: MAPREDUCE-3790.patch
>
>
> If a streaming job doesn't consume all of its input then the job can be marked successful even though the job's output is truncated.
> Here's a simple setup that can exhibit the problem.  Note that the job output will most likely be truncated compared to the same job run with a zero-length input file.
> {code}
> $ hdfs dfs -cat in
> foo
> $ yarn jar ./share/hadoop/tools/lib/hadoop-streaming-0.24.0-SNAPSHOT.jar -Dmapred.map.tasks=1 -Dmapred.reduce.tasks=1 -mapper /bin/env -reducer NONE -input in -output out
> {code}
> Examining the map task log shows this:
> {code:title=Excerpt from map task stdout log}
> 2012-02-02 11:27:25,054 WARN [main] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: Broken pipe
> 2012-02-02 11:27:25,054 INFO [main] org.apache.hadoop.streaming.PipeMapRed: mapRedFinished
> 2012-02-02 11:27:25,056 WARN [Thread-12] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: Bad file descriptor
> 2012-02-02 11:27:25,124 INFO [main] org.apache.hadoop.mapred.Task: Task:attempt_1328203555769_0001_m_000000_0 is done. And is in the process of commiting
> 2012-02-02 11:27:25,127 WARN [Thread-11] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: DFSOutputStream is closed
> 2012-02-02 11:27:25,199 INFO [main] org.apache.hadoop.mapred.Task: Task attempt_1328203555769_0001_m_000000_0 is allowed to commit now
> 2012-02-02 11:27:25,225 INFO [main] org.apache.hadoop.mapred.FileOutputCommitter: Saved output of task 'attempt_1328203555769_0001_m_000000_0' to hdfs://localhost:9000/user/somebody/out/_temporary/1
> 2012-02-02 11:27:27,834 INFO [main] org.apache.hadoop.mapred.Task: Task 'attempt_1328203555769_0001_m_000000_0' done.
> {code}
> In PipeMapRed.mapRedFinished() we can see it will eat IOExceptions and return without waiting for the output threads or throwing a runtime exception to fail the job.  Net result is that the DFS streams could be shutdown too early if the output threads are still busy and we could lose job output.
> Fixing this brings up the bigger question of what *should* happen when a streaming job doesn't consume all of its input.  Should we have grabbed all of the output from the job and still marked it successful or should we have failed the job?  If the former then we need to fix some other places in the code as well, since feeding a much larger input file (e.g.: 600K) to the same sample streaming job results in the job failing with the exception below.  It wouldn't be consistent to fail the job that doesn't consume a lot of input but pass the job that leaves just a few leftovers.
> {code}
> 2012-02-02 10:29:37,220 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1270)) - Running job: job_1328200108174_0001
> 2012-02-02 10:29:44,354 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1291)) - Job job_1328200108174_0001 running in uber mode : false
> 2012-02-02 10:29:44,355 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1298)) -  map 0% reduce 0%
> 2012-02-02 10:29:46,394 INFO  mapreduce.Job (Job.java:printTaskEvents(1386)) - Task Id : attempt_1328200108174_0001_m_000000_0, Status : FAILED
> Error: java.io.IOException: Broken pipe
> 	at java.io.FileOutputStream.writeBytes(Native Method)
> 	at java.io.FileOutputStream.write(FileOutputStream.java:282)
> 	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
> 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> 	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
> 	at java.io.DataOutputStream.write(DataOutputStream.java:90)
> 	at org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72)
> 	at org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51)
> 	at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:106)
> 	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
> 	at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
> 	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:394)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:329)
> 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:147)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
> 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:142)
> {code}
> Assuming the job returns a successful exit code, I think we should allow the job to complete successfully even though it doesn't consume all of its inputs.  Part of the reasoning is that there's already this comment in PipeMapper.java that implies we desire that behavior:
> {code:title=PipeMapper.java}
>         // terminate with success:
>         // swallow input records although the stream processor failed/closed
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-3790) Broken pipe on streaming job can lead to truncated output for a successful job

Posted by "Jason Lowe (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Lowe updated MAPREDUCE-3790:
----------------------------------

    Target Version/s: 0.23.1, 0.24.0  (was: 0.24.0, 0.23.1)
              Status: Patch Available  (was: Open)

Sorry, I should have clarified when I posted the manual test-patch run.  The reported javadoc warnings are unrelated to the patch, as they are specific to these projects and unrelated to anything in hadoop-streaming:

* hadoop-auth
* hadoop-common
* hadoop-rumen (most are here)
* hadoop-extras
                
> Broken pipe on streaming job can lead to truncated output for a successful job
> ------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3790
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3790
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: contrib/streaming
>    Affects Versions: 0.23.1, 0.24.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>         Attachments: MAPREDUCE-3790.patch
>
>
> If a streaming job doesn't consume all of its input then the job can be marked successful even though the job's output is truncated.
> Here's a simple setup that can exhibit the problem.  Note that the job output will most likely be truncated compared to the same job run with a zero-length input file.
> {code}
> $ hdfs dfs -cat in
> foo
> $ yarn jar ./share/hadoop/tools/lib/hadoop-streaming-0.24.0-SNAPSHOT.jar -Dmapred.map.tasks=1 -Dmapred.reduce.tasks=1 -mapper /bin/env -reducer NONE -input in -output out
> {code}
> Examining the map task log shows this:
> {code:title=Excerpt from map task stdout log}
> 2012-02-02 11:27:25,054 WARN [main] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: Broken pipe
> 2012-02-02 11:27:25,054 INFO [main] org.apache.hadoop.streaming.PipeMapRed: mapRedFinished
> 2012-02-02 11:27:25,056 WARN [Thread-12] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: Bad file descriptor
> 2012-02-02 11:27:25,124 INFO [main] org.apache.hadoop.mapred.Task: Task:attempt_1328203555769_0001_m_000000_0 is done. And is in the process of commiting
> 2012-02-02 11:27:25,127 WARN [Thread-11] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: DFSOutputStream is closed
> 2012-02-02 11:27:25,199 INFO [main] org.apache.hadoop.mapred.Task: Task attempt_1328203555769_0001_m_000000_0 is allowed to commit now
> 2012-02-02 11:27:25,225 INFO [main] org.apache.hadoop.mapred.FileOutputCommitter: Saved output of task 'attempt_1328203555769_0001_m_000000_0' to hdfs://localhost:9000/user/somebody/out/_temporary/1
> 2012-02-02 11:27:27,834 INFO [main] org.apache.hadoop.mapred.Task: Task 'attempt_1328203555769_0001_m_000000_0' done.
> {code}
> In PipeMapRed.mapRedFinished() we can see it will eat IOExceptions and return without waiting for the output threads or throwing a runtime exception to fail the job.  Net result is that the DFS streams could be shutdown too early if the output threads are still busy and we could lose job output.
> Fixing this brings up the bigger question of what *should* happen when a streaming job doesn't consume all of its input.  Should we have grabbed all of the output from the job and still marked it successful or should we have failed the job?  If the former then we need to fix some other places in the code as well, since feeding a much larger input file (e.g.: 600K) to the same sample streaming job results in the job failing with the exception below.  It wouldn't be consistent to fail the job that doesn't consume a lot of input but pass the job that leaves just a few leftovers.
> {code}
> 2012-02-02 10:29:37,220 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1270)) - Running job: job_1328200108174_0001
> 2012-02-02 10:29:44,354 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1291)) - Job job_1328200108174_0001 running in uber mode : false
> 2012-02-02 10:29:44,355 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1298)) -  map 0% reduce 0%
> 2012-02-02 10:29:46,394 INFO  mapreduce.Job (Job.java:printTaskEvents(1386)) - Task Id : attempt_1328200108174_0001_m_000000_0, Status : FAILED
> Error: java.io.IOException: Broken pipe
> 	at java.io.FileOutputStream.writeBytes(Native Method)
> 	at java.io.FileOutputStream.write(FileOutputStream.java:282)
> 	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
> 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> 	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
> 	at java.io.DataOutputStream.write(DataOutputStream.java:90)
> 	at org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72)
> 	at org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51)
> 	at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:106)
> 	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
> 	at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
> 	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:394)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:329)
> 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:147)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
> 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:142)
> {code}
> Assuming the job returns a successful exit code, I think we should allow the job to complete successfully even though it doesn't consume all of its inputs.  Part of the reasoning is that there's already this comment in PipeMapper.java that implies we desire that behavior:
> {code:title=PipeMapper.java}
>         // terminate with success:
>         // swallow input records although the stream processor failed/closed
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3790) Broken pipe on streaming job can lead to truncated output for a successful job

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13218404#comment-13218404 ] 

Hudson commented on MAPREDUCE-3790:
-----------------------------------

Integrated in Hadoop-Common-trunk-Commit #1794 (See [https://builds.apache.org/job/Hadoop-Common-trunk-Commit/1794/])
    MAPREDUCE-3790 Broken pipe on streaming job can lead to truncated output for a successful job (Jason Lowe via bobby) (Revision 1294750)
MAPREDUCE-3790 Broken pipe on streaming job can lead to truncated output for a successful job (Jason Lowe via bobby) (Revision 1294743)

     Result = SUCCESS
bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1294750
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt

bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1294743
Files : 
* /hadoop/common/trunk/hadoop-tools/hadoop-streaming/src/main/java/org/apache/hadoop/streaming/PipeMapRed.java
* /hadoop/common/trunk/hadoop-tools/hadoop-streaming/src/test/java/org/apache/hadoop/streaming/OutputOnlyApp.java
* /hadoop/common/trunk/hadoop-tools/hadoop-streaming/src/test/java/org/apache/hadoop/streaming/TestUnconsumedInput.java

                
> Broken pipe on streaming job can lead to truncated output for a successful job
> ------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3790
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3790
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: contrib/streaming, mrv2
>    Affects Versions: 0.23.1, 0.24.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3790.patch
>
>
> If a streaming job doesn't consume all of its input then the job can be marked successful even though the job's output is truncated.
> Here's a simple setup that can exhibit the problem.  Note that the job output will most likely be truncated compared to the same job run with a zero-length input file.
> {code}
> $ hdfs dfs -cat in
> foo
> $ yarn jar ./share/hadoop/tools/lib/hadoop-streaming-0.24.0-SNAPSHOT.jar -Dmapred.map.tasks=1 -Dmapred.reduce.tasks=1 -mapper /bin/env -reducer NONE -input in -output out
> {code}
> Examining the map task log shows this:
> {code:title=Excerpt from map task stdout log}
> 2012-02-02 11:27:25,054 WARN [main] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: Broken pipe
> 2012-02-02 11:27:25,054 INFO [main] org.apache.hadoop.streaming.PipeMapRed: mapRedFinished
> 2012-02-02 11:27:25,056 WARN [Thread-12] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: Bad file descriptor
> 2012-02-02 11:27:25,124 INFO [main] org.apache.hadoop.mapred.Task: Task:attempt_1328203555769_0001_m_000000_0 is done. And is in the process of commiting
> 2012-02-02 11:27:25,127 WARN [Thread-11] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: DFSOutputStream is closed
> 2012-02-02 11:27:25,199 INFO [main] org.apache.hadoop.mapred.Task: Task attempt_1328203555769_0001_m_000000_0 is allowed to commit now
> 2012-02-02 11:27:25,225 INFO [main] org.apache.hadoop.mapred.FileOutputCommitter: Saved output of task 'attempt_1328203555769_0001_m_000000_0' to hdfs://localhost:9000/user/somebody/out/_temporary/1
> 2012-02-02 11:27:27,834 INFO [main] org.apache.hadoop.mapred.Task: Task 'attempt_1328203555769_0001_m_000000_0' done.
> {code}
> In PipeMapRed.mapRedFinished() we can see it will eat IOExceptions and return without waiting for the output threads or throwing a runtime exception to fail the job.  Net result is that the DFS streams could be shutdown too early if the output threads are still busy and we could lose job output.
> Fixing this brings up the bigger question of what *should* happen when a streaming job doesn't consume all of its input.  Should we have grabbed all of the output from the job and still marked it successful or should we have failed the job?  If the former then we need to fix some other places in the code as well, since feeding a much larger input file (e.g.: 600K) to the same sample streaming job results in the job failing with the exception below.  It wouldn't be consistent to fail the job that doesn't consume a lot of input but pass the job that leaves just a few leftovers.
> {code}
> 2012-02-02 10:29:37,220 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1270)) - Running job: job_1328200108174_0001
> 2012-02-02 10:29:44,354 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1291)) - Job job_1328200108174_0001 running in uber mode : false
> 2012-02-02 10:29:44,355 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1298)) -  map 0% reduce 0%
> 2012-02-02 10:29:46,394 INFO  mapreduce.Job (Job.java:printTaskEvents(1386)) - Task Id : attempt_1328200108174_0001_m_000000_0, Status : FAILED
> Error: java.io.IOException: Broken pipe
> 	at java.io.FileOutputStream.writeBytes(Native Method)
> 	at java.io.FileOutputStream.write(FileOutputStream.java:282)
> 	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
> 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> 	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
> 	at java.io.DataOutputStream.write(DataOutputStream.java:90)
> 	at org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72)
> 	at org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51)
> 	at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:106)
> 	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
> 	at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
> 	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:394)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:329)
> 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:147)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
> 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:142)
> {code}
> Assuming the job returns a successful exit code, I think we should allow the job to complete successfully even though it doesn't consume all of its inputs.  Part of the reasoning is that there's already this comment in PipeMapper.java that implies we desire that behavior:
> {code:title=PipeMapper.java}
>         // terminate with success:
>         // swallow input records although the stream processor failed/closed
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3790) Broken pipe on streaming job can lead to truncated output for a successful job

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13219148#comment-13219148 ] 

Hudson commented on MAPREDUCE-3790:
-----------------------------------

Integrated in Hadoop-Hdfs-trunk #970 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/970/])
    MAPREDUCE-3790 Broken pipe on streaming job can lead to truncated output for a successful job (Jason Lowe via bobby) (Revision 1294750)
MAPREDUCE-3790 Broken pipe on streaming job can lead to truncated output for a successful job (Jason Lowe via bobby) (Revision 1294743)

     Result = SUCCESS
bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1294750
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt

bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1294743
Files : 
* /hadoop/common/trunk/hadoop-tools/hadoop-streaming/src/main/java/org/apache/hadoop/streaming/PipeMapRed.java
* /hadoop/common/trunk/hadoop-tools/hadoop-streaming/src/test/java/org/apache/hadoop/streaming/OutputOnlyApp.java
* /hadoop/common/trunk/hadoop-tools/hadoop-streaming/src/test/java/org/apache/hadoop/streaming/TestUnconsumedInput.java

                
> Broken pipe on streaming job can lead to truncated output for a successful job
> ------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3790
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3790
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: contrib/streaming, mrv2
>    Affects Versions: 0.23.1, 0.24.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3790.patch
>
>
> If a streaming job doesn't consume all of its input then the job can be marked successful even though the job's output is truncated.
> Here's a simple setup that can exhibit the problem.  Note that the job output will most likely be truncated compared to the same job run with a zero-length input file.
> {code}
> $ hdfs dfs -cat in
> foo
> $ yarn jar ./share/hadoop/tools/lib/hadoop-streaming-0.24.0-SNAPSHOT.jar -Dmapred.map.tasks=1 -Dmapred.reduce.tasks=1 -mapper /bin/env -reducer NONE -input in -output out
> {code}
> Examining the map task log shows this:
> {code:title=Excerpt from map task stdout log}
> 2012-02-02 11:27:25,054 WARN [main] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: Broken pipe
> 2012-02-02 11:27:25,054 INFO [main] org.apache.hadoop.streaming.PipeMapRed: mapRedFinished
> 2012-02-02 11:27:25,056 WARN [Thread-12] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: Bad file descriptor
> 2012-02-02 11:27:25,124 INFO [main] org.apache.hadoop.mapred.Task: Task:attempt_1328203555769_0001_m_000000_0 is done. And is in the process of commiting
> 2012-02-02 11:27:25,127 WARN [Thread-11] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: DFSOutputStream is closed
> 2012-02-02 11:27:25,199 INFO [main] org.apache.hadoop.mapred.Task: Task attempt_1328203555769_0001_m_000000_0 is allowed to commit now
> 2012-02-02 11:27:25,225 INFO [main] org.apache.hadoop.mapred.FileOutputCommitter: Saved output of task 'attempt_1328203555769_0001_m_000000_0' to hdfs://localhost:9000/user/somebody/out/_temporary/1
> 2012-02-02 11:27:27,834 INFO [main] org.apache.hadoop.mapred.Task: Task 'attempt_1328203555769_0001_m_000000_0' done.
> {code}
> In PipeMapRed.mapRedFinished() we can see it will eat IOExceptions and return without waiting for the output threads or throwing a runtime exception to fail the job.  Net result is that the DFS streams could be shutdown too early if the output threads are still busy and we could lose job output.
> Fixing this brings up the bigger question of what *should* happen when a streaming job doesn't consume all of its input.  Should we have grabbed all of the output from the job and still marked it successful or should we have failed the job?  If the former then we need to fix some other places in the code as well, since feeding a much larger input file (e.g.: 600K) to the same sample streaming job results in the job failing with the exception below.  It wouldn't be consistent to fail the job that doesn't consume a lot of input but pass the job that leaves just a few leftovers.
> {code}
> 2012-02-02 10:29:37,220 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1270)) - Running job: job_1328200108174_0001
> 2012-02-02 10:29:44,354 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1291)) - Job job_1328200108174_0001 running in uber mode : false
> 2012-02-02 10:29:44,355 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1298)) -  map 0% reduce 0%
> 2012-02-02 10:29:46,394 INFO  mapreduce.Job (Job.java:printTaskEvents(1386)) - Task Id : attempt_1328200108174_0001_m_000000_0, Status : FAILED
> Error: java.io.IOException: Broken pipe
> 	at java.io.FileOutputStream.writeBytes(Native Method)
> 	at java.io.FileOutputStream.write(FileOutputStream.java:282)
> 	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
> 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> 	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
> 	at java.io.DataOutputStream.write(DataOutputStream.java:90)
> 	at org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72)
> 	at org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51)
> 	at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:106)
> 	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
> 	at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
> 	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:394)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:329)
> 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:147)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
> 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:142)
> {code}
> Assuming the job returns a successful exit code, I think we should allow the job to complete successfully even though it doesn't consume all of its inputs.  Part of the reasoning is that there's already this comment in PipeMapper.java that implies we desire that behavior:
> {code:title=PipeMapper.java}
>         // terminate with success:
>         // swallow input records although the stream processor failed/closed
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3790) Broken pipe on streaming job can lead to truncated output for a successful job

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13218403#comment-13218403 ] 

Hudson commented on MAPREDUCE-3790:
-----------------------------------

Integrated in Hadoop-Hdfs-trunk-Commit #1868 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/1868/])
    MAPREDUCE-3790 Broken pipe on streaming job can lead to truncated output for a successful job (Jason Lowe via bobby) (Revision 1294750)
MAPREDUCE-3790 Broken pipe on streaming job can lead to truncated output for a successful job (Jason Lowe via bobby) (Revision 1294743)

     Result = SUCCESS
bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1294750
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt

bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1294743
Files : 
* /hadoop/common/trunk/hadoop-tools/hadoop-streaming/src/main/java/org/apache/hadoop/streaming/PipeMapRed.java
* /hadoop/common/trunk/hadoop-tools/hadoop-streaming/src/test/java/org/apache/hadoop/streaming/OutputOnlyApp.java
* /hadoop/common/trunk/hadoop-tools/hadoop-streaming/src/test/java/org/apache/hadoop/streaming/TestUnconsumedInput.java

                
> Broken pipe on streaming job can lead to truncated output for a successful job
> ------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3790
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3790
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: contrib/streaming, mrv2
>    Affects Versions: 0.23.1, 0.24.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3790.patch
>
>
> If a streaming job doesn't consume all of its input then the job can be marked successful even though the job's output is truncated.
> Here's a simple setup that can exhibit the problem.  Note that the job output will most likely be truncated compared to the same job run with a zero-length input file.
> {code}
> $ hdfs dfs -cat in
> foo
> $ yarn jar ./share/hadoop/tools/lib/hadoop-streaming-0.24.0-SNAPSHOT.jar -Dmapred.map.tasks=1 -Dmapred.reduce.tasks=1 -mapper /bin/env -reducer NONE -input in -output out
> {code}
> Examining the map task log shows this:
> {code:title=Excerpt from map task stdout log}
> 2012-02-02 11:27:25,054 WARN [main] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: Broken pipe
> 2012-02-02 11:27:25,054 INFO [main] org.apache.hadoop.streaming.PipeMapRed: mapRedFinished
> 2012-02-02 11:27:25,056 WARN [Thread-12] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: Bad file descriptor
> 2012-02-02 11:27:25,124 INFO [main] org.apache.hadoop.mapred.Task: Task:attempt_1328203555769_0001_m_000000_0 is done. And is in the process of commiting
> 2012-02-02 11:27:25,127 WARN [Thread-11] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: DFSOutputStream is closed
> 2012-02-02 11:27:25,199 INFO [main] org.apache.hadoop.mapred.Task: Task attempt_1328203555769_0001_m_000000_0 is allowed to commit now
> 2012-02-02 11:27:25,225 INFO [main] org.apache.hadoop.mapred.FileOutputCommitter: Saved output of task 'attempt_1328203555769_0001_m_000000_0' to hdfs://localhost:9000/user/somebody/out/_temporary/1
> 2012-02-02 11:27:27,834 INFO [main] org.apache.hadoop.mapred.Task: Task 'attempt_1328203555769_0001_m_000000_0' done.
> {code}
> In PipeMapRed.mapRedFinished() we can see it will eat IOExceptions and return without waiting for the output threads or throwing a runtime exception to fail the job.  Net result is that the DFS streams could be shutdown too early if the output threads are still busy and we could lose job output.
> Fixing this brings up the bigger question of what *should* happen when a streaming job doesn't consume all of its input.  Should we have grabbed all of the output from the job and still marked it successful or should we have failed the job?  If the former then we need to fix some other places in the code as well, since feeding a much larger input file (e.g.: 600K) to the same sample streaming job results in the job failing with the exception below.  It wouldn't be consistent to fail the job that doesn't consume a lot of input but pass the job that leaves just a few leftovers.
> {code}
> 2012-02-02 10:29:37,220 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1270)) - Running job: job_1328200108174_0001
> 2012-02-02 10:29:44,354 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1291)) - Job job_1328200108174_0001 running in uber mode : false
> 2012-02-02 10:29:44,355 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1298)) -  map 0% reduce 0%
> 2012-02-02 10:29:46,394 INFO  mapreduce.Job (Job.java:printTaskEvents(1386)) - Task Id : attempt_1328200108174_0001_m_000000_0, Status : FAILED
> Error: java.io.IOException: Broken pipe
> 	at java.io.FileOutputStream.writeBytes(Native Method)
> 	at java.io.FileOutputStream.write(FileOutputStream.java:282)
> 	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
> 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> 	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
> 	at java.io.DataOutputStream.write(DataOutputStream.java:90)
> 	at org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72)
> 	at org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51)
> 	at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:106)
> 	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
> 	at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
> 	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:394)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:329)
> 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:147)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
> 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:142)
> {code}
> Assuming the job returns a successful exit code, I think we should allow the job to complete successfully even though it doesn't consume all of its inputs.  Part of the reasoning is that there's already this comment in PipeMapper.java that implies we desire that behavior:
> {code:title=PipeMapper.java}
>         // terminate with success:
>         // swallow input records although the stream processor failed/closed
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3790) Broken pipe on streaming job can lead to truncated output for a successful job

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13219215#comment-13219215 ] 

Hudson commented on MAPREDUCE-3790:
-----------------------------------

Integrated in Hadoop-Mapreduce-trunk #1005 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1005/])
    MAPREDUCE-3790 Broken pipe on streaming job can lead to truncated output for a successful job (Jason Lowe via bobby) (Revision 1294750)
MAPREDUCE-3790 Broken pipe on streaming job can lead to truncated output for a successful job (Jason Lowe via bobby) (Revision 1294743)

     Result = SUCCESS
bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1294750
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt

bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1294743
Files : 
* /hadoop/common/trunk/hadoop-tools/hadoop-streaming/src/main/java/org/apache/hadoop/streaming/PipeMapRed.java
* /hadoop/common/trunk/hadoop-tools/hadoop-streaming/src/test/java/org/apache/hadoop/streaming/OutputOnlyApp.java
* /hadoop/common/trunk/hadoop-tools/hadoop-streaming/src/test/java/org/apache/hadoop/streaming/TestUnconsumedInput.java

                
> Broken pipe on streaming job can lead to truncated output for a successful job
> ------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3790
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3790
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: contrib/streaming, mrv2
>    Affects Versions: 0.23.1, 0.24.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3790.patch
>
>
> If a streaming job doesn't consume all of its input then the job can be marked successful even though the job's output is truncated.
> Here's a simple setup that can exhibit the problem.  Note that the job output will most likely be truncated compared to the same job run with a zero-length input file.
> {code}
> $ hdfs dfs -cat in
> foo
> $ yarn jar ./share/hadoop/tools/lib/hadoop-streaming-0.24.0-SNAPSHOT.jar -Dmapred.map.tasks=1 -Dmapred.reduce.tasks=1 -mapper /bin/env -reducer NONE -input in -output out
> {code}
> Examining the map task log shows this:
> {code:title=Excerpt from map task stdout log}
> 2012-02-02 11:27:25,054 WARN [main] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: Broken pipe
> 2012-02-02 11:27:25,054 INFO [main] org.apache.hadoop.streaming.PipeMapRed: mapRedFinished
> 2012-02-02 11:27:25,056 WARN [Thread-12] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: Bad file descriptor
> 2012-02-02 11:27:25,124 INFO [main] org.apache.hadoop.mapred.Task: Task:attempt_1328203555769_0001_m_000000_0 is done. And is in the process of commiting
> 2012-02-02 11:27:25,127 WARN [Thread-11] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: DFSOutputStream is closed
> 2012-02-02 11:27:25,199 INFO [main] org.apache.hadoop.mapred.Task: Task attempt_1328203555769_0001_m_000000_0 is allowed to commit now
> 2012-02-02 11:27:25,225 INFO [main] org.apache.hadoop.mapred.FileOutputCommitter: Saved output of task 'attempt_1328203555769_0001_m_000000_0' to hdfs://localhost:9000/user/somebody/out/_temporary/1
> 2012-02-02 11:27:27,834 INFO [main] org.apache.hadoop.mapred.Task: Task 'attempt_1328203555769_0001_m_000000_0' done.
> {code}
> In PipeMapRed.mapRedFinished() we can see it will eat IOExceptions and return without waiting for the output threads or throwing a runtime exception to fail the job.  Net result is that the DFS streams could be shutdown too early if the output threads are still busy and we could lose job output.
> Fixing this brings up the bigger question of what *should* happen when a streaming job doesn't consume all of its input.  Should we have grabbed all of the output from the job and still marked it successful or should we have failed the job?  If the former then we need to fix some other places in the code as well, since feeding a much larger input file (e.g.: 600K) to the same sample streaming job results in the job failing with the exception below.  It wouldn't be consistent to fail the job that doesn't consume a lot of input but pass the job that leaves just a few leftovers.
> {code}
> 2012-02-02 10:29:37,220 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1270)) - Running job: job_1328200108174_0001
> 2012-02-02 10:29:44,354 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1291)) - Job job_1328200108174_0001 running in uber mode : false
> 2012-02-02 10:29:44,355 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1298)) -  map 0% reduce 0%
> 2012-02-02 10:29:46,394 INFO  mapreduce.Job (Job.java:printTaskEvents(1386)) - Task Id : attempt_1328200108174_0001_m_000000_0, Status : FAILED
> Error: java.io.IOException: Broken pipe
> 	at java.io.FileOutputStream.writeBytes(Native Method)
> 	at java.io.FileOutputStream.write(FileOutputStream.java:282)
> 	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
> 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> 	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
> 	at java.io.DataOutputStream.write(DataOutputStream.java:90)
> 	at org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72)
> 	at org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51)
> 	at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:106)
> 	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
> 	at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
> 	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:394)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:329)
> 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:147)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
> 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:142)
> {code}
> Assuming the job returns a successful exit code, I think we should allow the job to complete successfully even though it doesn't consume all of its inputs.  Part of the reasoning is that there's already this comment in PipeMapper.java that implies we desire that behavior:
> {code:title=PipeMapper.java}
>         // terminate with success:
>         // swallow input records although the stream processor failed/closed
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3790) Broken pipe on streaming job can lead to truncated output for a successful job

Posted by "Jason Lowe (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13200135#comment-13200135 ] 

Jason Lowe commented on MAPREDUCE-3790:
---------------------------------------

Upon closer investigation of the code, there's already a config option, stream.minRecWrittenToEnableSkip_, to specify input errors should be skipped beyond a certain number of records output.  This can be set to 0 to ignore any errors on the input such as broken pipe.

That still leaves the race condition in mapRedFinished() where we can close the DFSOutputStream before the output thread has finished, but there's existing support for allowing streaming jobs to ignore input.
                
> Broken pipe on streaming job can lead to truncated output for a successful job
> ------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3790
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3790
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: contrib/streaming
>    Affects Versions: 0.23.1, 0.24.0
>            Reporter: Jason Lowe
>
> If a streaming job doesn't consume all of its input then the job can be marked successful even though the job's output is truncated.
> Here's a simple setup that can exhibit the problem.  Note that the job output will most likely be truncated compared to the same job run with a zero-length input file.
> {code}
> $ hdfs dfs -cat in
> foo
> $ yarn jar ./share/hadoop/tools/lib/hadoop-streaming-0.24.0-SNAPSHOT.jar -Dmapred.map.tasks=1 -Dmapred.reduce.tasks=1 -mapper /bin/env -reducer NONE -input in -output out
> {code}
> Examining the map task log shows this:
> {code:title=Excerpt from map task stdout log}
> 2012-02-02 11:27:25,054 WARN [main] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: Broken pipe
> 2012-02-02 11:27:25,054 INFO [main] org.apache.hadoop.streaming.PipeMapRed: mapRedFinished
> 2012-02-02 11:27:25,056 WARN [Thread-12] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: Bad file descriptor
> 2012-02-02 11:27:25,124 INFO [main] org.apache.hadoop.mapred.Task: Task:attempt_1328203555769_0001_m_000000_0 is done. And is in the process of commiting
> 2012-02-02 11:27:25,127 WARN [Thread-11] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: DFSOutputStream is closed
> 2012-02-02 11:27:25,199 INFO [main] org.apache.hadoop.mapred.Task: Task attempt_1328203555769_0001_m_000000_0 is allowed to commit now
> 2012-02-02 11:27:25,225 INFO [main] org.apache.hadoop.mapred.FileOutputCommitter: Saved output of task 'attempt_1328203555769_0001_m_000000_0' to hdfs://localhost:9000/user/somebody/out/_temporary/1
> 2012-02-02 11:27:27,834 INFO [main] org.apache.hadoop.mapred.Task: Task 'attempt_1328203555769_0001_m_000000_0' done.
> {code}
> In PipeMapRed.mapRedFinished() we can see it will eat IOExceptions and return without waiting for the output threads or throwing a runtime exception to fail the job.  Net result is that the DFS streams could be shutdown too early if the output threads are still busy and we could lose job output.
> Fixing this brings up the bigger question of what *should* happen when a streaming job doesn't consume all of its input.  Should we have grabbed all of the output from the job and still marked it successful or should we have failed the job?  If the former then we need to fix some other places in the code as well, since feeding a much larger input file (e.g.: 600K) to the same sample streaming job results in the job failing with the exception below.  It wouldn't be consistent to fail the job that doesn't consume a lot of input but pass the job that leaves just a few leftovers.
> {code}
> 2012-02-02 10:29:37,220 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1270)) - Running job: job_1328200108174_0001
> 2012-02-02 10:29:44,354 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1291)) - Job job_1328200108174_0001 running in uber mode : false
> 2012-02-02 10:29:44,355 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1298)) -  map 0% reduce 0%
> 2012-02-02 10:29:46,394 INFO  mapreduce.Job (Job.java:printTaskEvents(1386)) - Task Id : attempt_1328200108174_0001_m_000000_0, Status : FAILED
> Error: java.io.IOException: Broken pipe
> 	at java.io.FileOutputStream.writeBytes(Native Method)
> 	at java.io.FileOutputStream.write(FileOutputStream.java:282)
> 	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
> 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> 	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
> 	at java.io.DataOutputStream.write(DataOutputStream.java:90)
> 	at org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72)
> 	at org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51)
> 	at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:106)
> 	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
> 	at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
> 	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:394)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:329)
> 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:147)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
> 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:142)
> {code}
> Assuming the job returns a successful exit code, I think we should allow the job to complete successfully even though it doesn't consume all of its inputs.  Part of the reasoning is that there's already this comment in PipeMapper.java that implies we desire that behavior:
> {code:title=PipeMapper.java}
>         // terminate with success:
>         // swallow input records although the stream processor failed/closed
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3790) Broken pipe on streaming job can lead to truncated output for a successful job

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13219186#comment-13219186 ] 

Hudson commented on MAPREDUCE-3790:
-----------------------------------

Integrated in Hadoop-Mapreduce-0.23-Build #211 (See [https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Build/211/])
    svn merge -c 1294750 trunk to branch-0.23 FIXES MAPREDUCE-3790 Broken pipe on streaming job can lead to truncated output for a successful job (Jason Lowe via bobby) (Revision 1294751)
svn merge -c 1294743 trunk to branch-0.23 FIXES MAPREDUCE-3790 Broken pipe on streaming job can lead to truncated output for a successful job (Jason Lowe via bobby) (Revision 1294747)

     Result = FAILURE
bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1294751
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt

bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1294747
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-tools/hadoop-streaming/src/main/java/org/apache/hadoop/streaming/PipeMapRed.java
* /hadoop/common/branches/branch-0.23/hadoop-tools/hadoop-streaming/src/test/java/org/apache/hadoop/streaming/OutputOnlyApp.java
* /hadoop/common/branches/branch-0.23/hadoop-tools/hadoop-streaming/src/test/java/org/apache/hadoop/streaming/TestUnconsumedInput.java

                
> Broken pipe on streaming job can lead to truncated output for a successful job
> ------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3790
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3790
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: contrib/streaming, mrv2
>    Affects Versions: 0.23.1, 0.24.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3790.patch
>
>
> If a streaming job doesn't consume all of its input then the job can be marked successful even though the job's output is truncated.
> Here's a simple setup that can exhibit the problem.  Note that the job output will most likely be truncated compared to the same job run with a zero-length input file.
> {code}
> $ hdfs dfs -cat in
> foo
> $ yarn jar ./share/hadoop/tools/lib/hadoop-streaming-0.24.0-SNAPSHOT.jar -Dmapred.map.tasks=1 -Dmapred.reduce.tasks=1 -mapper /bin/env -reducer NONE -input in -output out
> {code}
> Examining the map task log shows this:
> {code:title=Excerpt from map task stdout log}
> 2012-02-02 11:27:25,054 WARN [main] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: Broken pipe
> 2012-02-02 11:27:25,054 INFO [main] org.apache.hadoop.streaming.PipeMapRed: mapRedFinished
> 2012-02-02 11:27:25,056 WARN [Thread-12] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: Bad file descriptor
> 2012-02-02 11:27:25,124 INFO [main] org.apache.hadoop.mapred.Task: Task:attempt_1328203555769_0001_m_000000_0 is done. And is in the process of commiting
> 2012-02-02 11:27:25,127 WARN [Thread-11] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: DFSOutputStream is closed
> 2012-02-02 11:27:25,199 INFO [main] org.apache.hadoop.mapred.Task: Task attempt_1328203555769_0001_m_000000_0 is allowed to commit now
> 2012-02-02 11:27:25,225 INFO [main] org.apache.hadoop.mapred.FileOutputCommitter: Saved output of task 'attempt_1328203555769_0001_m_000000_0' to hdfs://localhost:9000/user/somebody/out/_temporary/1
> 2012-02-02 11:27:27,834 INFO [main] org.apache.hadoop.mapred.Task: Task 'attempt_1328203555769_0001_m_000000_0' done.
> {code}
> In PipeMapRed.mapRedFinished() we can see it will eat IOExceptions and return without waiting for the output threads or throwing a runtime exception to fail the job.  Net result is that the DFS streams could be shutdown too early if the output threads are still busy and we could lose job output.
> Fixing this brings up the bigger question of what *should* happen when a streaming job doesn't consume all of its input.  Should we have grabbed all of the output from the job and still marked it successful or should we have failed the job?  If the former then we need to fix some other places in the code as well, since feeding a much larger input file (e.g.: 600K) to the same sample streaming job results in the job failing with the exception below.  It wouldn't be consistent to fail the job that doesn't consume a lot of input but pass the job that leaves just a few leftovers.
> {code}
> 2012-02-02 10:29:37,220 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1270)) - Running job: job_1328200108174_0001
> 2012-02-02 10:29:44,354 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1291)) - Job job_1328200108174_0001 running in uber mode : false
> 2012-02-02 10:29:44,355 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1298)) -  map 0% reduce 0%
> 2012-02-02 10:29:46,394 INFO  mapreduce.Job (Job.java:printTaskEvents(1386)) - Task Id : attempt_1328200108174_0001_m_000000_0, Status : FAILED
> Error: java.io.IOException: Broken pipe
> 	at java.io.FileOutputStream.writeBytes(Native Method)
> 	at java.io.FileOutputStream.write(FileOutputStream.java:282)
> 	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
> 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> 	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
> 	at java.io.DataOutputStream.write(DataOutputStream.java:90)
> 	at org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72)
> 	at org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51)
> 	at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:106)
> 	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
> 	at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
> 	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:394)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:329)
> 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:147)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
> 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:142)
> {code}
> Assuming the job returns a successful exit code, I think we should allow the job to complete successfully even though it doesn't consume all of its inputs.  Part of the reasoning is that there's already this comment in PipeMapper.java that implies we desire that behavior:
> {code:title=PipeMapper.java}
>         // terminate with success:
>         // swallow input records although the stream processor failed/closed
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3790) Broken pipe on streaming job can lead to truncated output for a successful job

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13200174#comment-13200174 ] 

Hadoop QA commented on MAPREDUCE-3790:
--------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12513202/MAPREDUCE-3790.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 4 new or modified tests.

    -1 patch.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1771//console

This message is automatically generated.
                
> Broken pipe on streaming job can lead to truncated output for a successful job
> ------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3790
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3790
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: contrib/streaming
>    Affects Versions: 0.23.1, 0.24.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>         Attachments: MAPREDUCE-3790.patch
>
>
> If a streaming job doesn't consume all of its input then the job can be marked successful even though the job's output is truncated.
> Here's a simple setup that can exhibit the problem.  Note that the job output will most likely be truncated compared to the same job run with a zero-length input file.
> {code}
> $ hdfs dfs -cat in
> foo
> $ yarn jar ./share/hadoop/tools/lib/hadoop-streaming-0.24.0-SNAPSHOT.jar -Dmapred.map.tasks=1 -Dmapred.reduce.tasks=1 -mapper /bin/env -reducer NONE -input in -output out
> {code}
> Examining the map task log shows this:
> {code:title=Excerpt from map task stdout log}
> 2012-02-02 11:27:25,054 WARN [main] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: Broken pipe
> 2012-02-02 11:27:25,054 INFO [main] org.apache.hadoop.streaming.PipeMapRed: mapRedFinished
> 2012-02-02 11:27:25,056 WARN [Thread-12] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: Bad file descriptor
> 2012-02-02 11:27:25,124 INFO [main] org.apache.hadoop.mapred.Task: Task:attempt_1328203555769_0001_m_000000_0 is done. And is in the process of commiting
> 2012-02-02 11:27:25,127 WARN [Thread-11] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: DFSOutputStream is closed
> 2012-02-02 11:27:25,199 INFO [main] org.apache.hadoop.mapred.Task: Task attempt_1328203555769_0001_m_000000_0 is allowed to commit now
> 2012-02-02 11:27:25,225 INFO [main] org.apache.hadoop.mapred.FileOutputCommitter: Saved output of task 'attempt_1328203555769_0001_m_000000_0' to hdfs://localhost:9000/user/somebody/out/_temporary/1
> 2012-02-02 11:27:27,834 INFO [main] org.apache.hadoop.mapred.Task: Task 'attempt_1328203555769_0001_m_000000_0' done.
> {code}
> In PipeMapRed.mapRedFinished() we can see it will eat IOExceptions and return without waiting for the output threads or throwing a runtime exception to fail the job.  Net result is that the DFS streams could be shutdown too early if the output threads are still busy and we could lose job output.
> Fixing this brings up the bigger question of what *should* happen when a streaming job doesn't consume all of its input.  Should we have grabbed all of the output from the job and still marked it successful or should we have failed the job?  If the former then we need to fix some other places in the code as well, since feeding a much larger input file (e.g.: 600K) to the same sample streaming job results in the job failing with the exception below.  It wouldn't be consistent to fail the job that doesn't consume a lot of input but pass the job that leaves just a few leftovers.
> {code}
> 2012-02-02 10:29:37,220 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1270)) - Running job: job_1328200108174_0001
> 2012-02-02 10:29:44,354 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1291)) - Job job_1328200108174_0001 running in uber mode : false
> 2012-02-02 10:29:44,355 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1298)) -  map 0% reduce 0%
> 2012-02-02 10:29:46,394 INFO  mapreduce.Job (Job.java:printTaskEvents(1386)) - Task Id : attempt_1328200108174_0001_m_000000_0, Status : FAILED
> Error: java.io.IOException: Broken pipe
> 	at java.io.FileOutputStream.writeBytes(Native Method)
> 	at java.io.FileOutputStream.write(FileOutputStream.java:282)
> 	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
> 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> 	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
> 	at java.io.DataOutputStream.write(DataOutputStream.java:90)
> 	at org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72)
> 	at org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51)
> 	at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:106)
> 	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
> 	at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
> 	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:394)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:329)
> 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:147)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
> 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:142)
> {code}
> Assuming the job returns a successful exit code, I think we should allow the job to complete successfully even though it doesn't consume all of its inputs.  Part of the reasoning is that there's already this comment in PipeMapper.java that implies we desire that behavior:
> {code:title=PipeMapper.java}
>         // terminate with success:
>         // swallow input records although the stream processor failed/closed
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-3790) Broken pipe on streaming job can lead to truncated output for a successful job

Posted by "Robert Joseph Evans (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Joseph Evans updated MAPREDUCE-3790:
-------------------------------------------

          Resolution: Fixed
       Fix Version/s: 0.23.2
    Target Version/s: 0.23.1, 0.24.0  (was: 0.24.0, 0.23.1)
              Status: Resolved  (was: Patch Available)

Thanks Jason, I just committed this to trunk, branch-0.23 and 0.23.2.
                
> Broken pipe on streaming job can lead to truncated output for a successful job
> ------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3790
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3790
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: contrib/streaming, mrv2
>    Affects Versions: 0.23.1, 0.24.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3790.patch
>
>
> If a streaming job doesn't consume all of its input then the job can be marked successful even though the job's output is truncated.
> Here's a simple setup that can exhibit the problem.  Note that the job output will most likely be truncated compared to the same job run with a zero-length input file.
> {code}
> $ hdfs dfs -cat in
> foo
> $ yarn jar ./share/hadoop/tools/lib/hadoop-streaming-0.24.0-SNAPSHOT.jar -Dmapred.map.tasks=1 -Dmapred.reduce.tasks=1 -mapper /bin/env -reducer NONE -input in -output out
> {code}
> Examining the map task log shows this:
> {code:title=Excerpt from map task stdout log}
> 2012-02-02 11:27:25,054 WARN [main] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: Broken pipe
> 2012-02-02 11:27:25,054 INFO [main] org.apache.hadoop.streaming.PipeMapRed: mapRedFinished
> 2012-02-02 11:27:25,056 WARN [Thread-12] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: Bad file descriptor
> 2012-02-02 11:27:25,124 INFO [main] org.apache.hadoop.mapred.Task: Task:attempt_1328203555769_0001_m_000000_0 is done. And is in the process of commiting
> 2012-02-02 11:27:25,127 WARN [Thread-11] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: DFSOutputStream is closed
> 2012-02-02 11:27:25,199 INFO [main] org.apache.hadoop.mapred.Task: Task attempt_1328203555769_0001_m_000000_0 is allowed to commit now
> 2012-02-02 11:27:25,225 INFO [main] org.apache.hadoop.mapred.FileOutputCommitter: Saved output of task 'attempt_1328203555769_0001_m_000000_0' to hdfs://localhost:9000/user/somebody/out/_temporary/1
> 2012-02-02 11:27:27,834 INFO [main] org.apache.hadoop.mapred.Task: Task 'attempt_1328203555769_0001_m_000000_0' done.
> {code}
> In PipeMapRed.mapRedFinished() we can see it will eat IOExceptions and return without waiting for the output threads or throwing a runtime exception to fail the job.  Net result is that the DFS streams could be shutdown too early if the output threads are still busy and we could lose job output.
> Fixing this brings up the bigger question of what *should* happen when a streaming job doesn't consume all of its input.  Should we have grabbed all of the output from the job and still marked it successful or should we have failed the job?  If the former then we need to fix some other places in the code as well, since feeding a much larger input file (e.g.: 600K) to the same sample streaming job results in the job failing with the exception below.  It wouldn't be consistent to fail the job that doesn't consume a lot of input but pass the job that leaves just a few leftovers.
> {code}
> 2012-02-02 10:29:37,220 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1270)) - Running job: job_1328200108174_0001
> 2012-02-02 10:29:44,354 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1291)) - Job job_1328200108174_0001 running in uber mode : false
> 2012-02-02 10:29:44,355 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1298)) -  map 0% reduce 0%
> 2012-02-02 10:29:46,394 INFO  mapreduce.Job (Job.java:printTaskEvents(1386)) - Task Id : attempt_1328200108174_0001_m_000000_0, Status : FAILED
> Error: java.io.IOException: Broken pipe
> 	at java.io.FileOutputStream.writeBytes(Native Method)
> 	at java.io.FileOutputStream.write(FileOutputStream.java:282)
> 	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
> 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> 	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
> 	at java.io.DataOutputStream.write(DataOutputStream.java:90)
> 	at org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72)
> 	at org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51)
> 	at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:106)
> 	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
> 	at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
> 	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:394)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:329)
> 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:147)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
> 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:142)
> {code}
> Assuming the job returns a successful exit code, I think we should allow the job to complete successfully even though it doesn't consume all of its inputs.  Part of the reasoning is that there's already this comment in PipeMapper.java that implies we desire that behavior:
> {code:title=PipeMapper.java}
>         // terminate with success:
>         // swallow input records although the stream processor failed/closed
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-3790) Broken pipe on streaming job can lead to truncated output for a successful job

Posted by "Arun C Murthy (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated MAPREDUCE-3790:
-------------------------------------

    Target Version/s: 0.23.1, 0.24.0  (was: 0.24.0, 0.23.1)
              Status: Open  (was: Patch Available)

Jason, can you pls look at the javadoc warnings?
                
> Broken pipe on streaming job can lead to truncated output for a successful job
> ------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3790
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3790
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: contrib/streaming
>    Affects Versions: 0.23.1, 0.24.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>         Attachments: MAPREDUCE-3790.patch
>
>
> If a streaming job doesn't consume all of its input then the job can be marked successful even though the job's output is truncated.
> Here's a simple setup that can exhibit the problem.  Note that the job output will most likely be truncated compared to the same job run with a zero-length input file.
> {code}
> $ hdfs dfs -cat in
> foo
> $ yarn jar ./share/hadoop/tools/lib/hadoop-streaming-0.24.0-SNAPSHOT.jar -Dmapred.map.tasks=1 -Dmapred.reduce.tasks=1 -mapper /bin/env -reducer NONE -input in -output out
> {code}
> Examining the map task log shows this:
> {code:title=Excerpt from map task stdout log}
> 2012-02-02 11:27:25,054 WARN [main] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: Broken pipe
> 2012-02-02 11:27:25,054 INFO [main] org.apache.hadoop.streaming.PipeMapRed: mapRedFinished
> 2012-02-02 11:27:25,056 WARN [Thread-12] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: Bad file descriptor
> 2012-02-02 11:27:25,124 INFO [main] org.apache.hadoop.mapred.Task: Task:attempt_1328203555769_0001_m_000000_0 is done. And is in the process of commiting
> 2012-02-02 11:27:25,127 WARN [Thread-11] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: DFSOutputStream is closed
> 2012-02-02 11:27:25,199 INFO [main] org.apache.hadoop.mapred.Task: Task attempt_1328203555769_0001_m_000000_0 is allowed to commit now
> 2012-02-02 11:27:25,225 INFO [main] org.apache.hadoop.mapred.FileOutputCommitter: Saved output of task 'attempt_1328203555769_0001_m_000000_0' to hdfs://localhost:9000/user/somebody/out/_temporary/1
> 2012-02-02 11:27:27,834 INFO [main] org.apache.hadoop.mapred.Task: Task 'attempt_1328203555769_0001_m_000000_0' done.
> {code}
> In PipeMapRed.mapRedFinished() we can see it will eat IOExceptions and return without waiting for the output threads or throwing a runtime exception to fail the job.  Net result is that the DFS streams could be shutdown too early if the output threads are still busy and we could lose job output.
> Fixing this brings up the bigger question of what *should* happen when a streaming job doesn't consume all of its input.  Should we have grabbed all of the output from the job and still marked it successful or should we have failed the job?  If the former then we need to fix some other places in the code as well, since feeding a much larger input file (e.g.: 600K) to the same sample streaming job results in the job failing with the exception below.  It wouldn't be consistent to fail the job that doesn't consume a lot of input but pass the job that leaves just a few leftovers.
> {code}
> 2012-02-02 10:29:37,220 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1270)) - Running job: job_1328200108174_0001
> 2012-02-02 10:29:44,354 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1291)) - Job job_1328200108174_0001 running in uber mode : false
> 2012-02-02 10:29:44,355 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1298)) -  map 0% reduce 0%
> 2012-02-02 10:29:46,394 INFO  mapreduce.Job (Job.java:printTaskEvents(1386)) - Task Id : attempt_1328200108174_0001_m_000000_0, Status : FAILED
> Error: java.io.IOException: Broken pipe
> 	at java.io.FileOutputStream.writeBytes(Native Method)
> 	at java.io.FileOutputStream.write(FileOutputStream.java:282)
> 	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
> 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> 	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
> 	at java.io.DataOutputStream.write(DataOutputStream.java:90)
> 	at org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72)
> 	at org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51)
> 	at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:106)
> 	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
> 	at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
> 	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:394)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:329)
> 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:147)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
> 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:142)
> {code}
> Assuming the job returns a successful exit code, I think we should allow the job to complete successfully even though it doesn't consume all of its inputs.  Part of the reasoning is that there's already this comment in PipeMapper.java that implies we desire that behavior:
> {code:title=PipeMapper.java}
>         // terminate with success:
>         // swallow input records although the stream processor failed/closed
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3790) Broken pipe on streaming job can lead to truncated output for a successful job

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13218441#comment-13218441 ] 

Hudson commented on MAPREDUCE-3790:
-----------------------------------

Integrated in Hadoop-Mapreduce-0.23-Commit #608 (See [https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Commit/608/])
    svn merge -c 1294743 trunk to branch-0.23 FIXES MAPREDUCE-3790 Broken pipe on streaming job can lead to truncated output for a successful job (Jason Lowe via bobby) (Revision 1294747)

     Result = ABORTED
bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1294747
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-tools/hadoop-streaming/src/main/java/org/apache/hadoop/streaming/PipeMapRed.java
* /hadoop/common/branches/branch-0.23/hadoop-tools/hadoop-streaming/src/test/java/org/apache/hadoop/streaming/OutputOnlyApp.java
* /hadoop/common/branches/branch-0.23/hadoop-tools/hadoop-streaming/src/test/java/org/apache/hadoop/streaming/TestUnconsumedInput.java

                
> Broken pipe on streaming job can lead to truncated output for a successful job
> ------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3790
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3790
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: contrib/streaming, mrv2
>    Affects Versions: 0.23.1, 0.24.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3790.patch
>
>
> If a streaming job doesn't consume all of its input then the job can be marked successful even though the job's output is truncated.
> Here's a simple setup that can exhibit the problem.  Note that the job output will most likely be truncated compared to the same job run with a zero-length input file.
> {code}
> $ hdfs dfs -cat in
> foo
> $ yarn jar ./share/hadoop/tools/lib/hadoop-streaming-0.24.0-SNAPSHOT.jar -Dmapred.map.tasks=1 -Dmapred.reduce.tasks=1 -mapper /bin/env -reducer NONE -input in -output out
> {code}
> Examining the map task log shows this:
> {code:title=Excerpt from map task stdout log}
> 2012-02-02 11:27:25,054 WARN [main] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: Broken pipe
> 2012-02-02 11:27:25,054 INFO [main] org.apache.hadoop.streaming.PipeMapRed: mapRedFinished
> 2012-02-02 11:27:25,056 WARN [Thread-12] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: Bad file descriptor
> 2012-02-02 11:27:25,124 INFO [main] org.apache.hadoop.mapred.Task: Task:attempt_1328203555769_0001_m_000000_0 is done. And is in the process of commiting
> 2012-02-02 11:27:25,127 WARN [Thread-11] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: DFSOutputStream is closed
> 2012-02-02 11:27:25,199 INFO [main] org.apache.hadoop.mapred.Task: Task attempt_1328203555769_0001_m_000000_0 is allowed to commit now
> 2012-02-02 11:27:25,225 INFO [main] org.apache.hadoop.mapred.FileOutputCommitter: Saved output of task 'attempt_1328203555769_0001_m_000000_0' to hdfs://localhost:9000/user/somebody/out/_temporary/1
> 2012-02-02 11:27:27,834 INFO [main] org.apache.hadoop.mapred.Task: Task 'attempt_1328203555769_0001_m_000000_0' done.
> {code}
> In PipeMapRed.mapRedFinished() we can see it will eat IOExceptions and return without waiting for the output threads or throwing a runtime exception to fail the job.  Net result is that the DFS streams could be shutdown too early if the output threads are still busy and we could lose job output.
> Fixing this brings up the bigger question of what *should* happen when a streaming job doesn't consume all of its input.  Should we have grabbed all of the output from the job and still marked it successful or should we have failed the job?  If the former then we need to fix some other places in the code as well, since feeding a much larger input file (e.g.: 600K) to the same sample streaming job results in the job failing with the exception below.  It wouldn't be consistent to fail the job that doesn't consume a lot of input but pass the job that leaves just a few leftovers.
> {code}
> 2012-02-02 10:29:37,220 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1270)) - Running job: job_1328200108174_0001
> 2012-02-02 10:29:44,354 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1291)) - Job job_1328200108174_0001 running in uber mode : false
> 2012-02-02 10:29:44,355 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1298)) -  map 0% reduce 0%
> 2012-02-02 10:29:46,394 INFO  mapreduce.Job (Job.java:printTaskEvents(1386)) - Task Id : attempt_1328200108174_0001_m_000000_0, Status : FAILED
> Error: java.io.IOException: Broken pipe
> 	at java.io.FileOutputStream.writeBytes(Native Method)
> 	at java.io.FileOutputStream.write(FileOutputStream.java:282)
> 	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
> 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> 	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
> 	at java.io.DataOutputStream.write(DataOutputStream.java:90)
> 	at org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72)
> 	at org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51)
> 	at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:106)
> 	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
> 	at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
> 	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:394)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:329)
> 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:147)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
> 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:142)
> {code}
> Assuming the job returns a successful exit code, I think we should allow the job to complete successfully even though it doesn't consume all of its inputs.  Part of the reasoning is that there's already this comment in PipeMapper.java that implies we desire that behavior:
> {code:title=PipeMapper.java}
>         // terminate with success:
>         // swallow input records although the stream processor failed/closed
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3790) Broken pipe on streaming job can lead to truncated output for a successful job

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13218401#comment-13218401 ] 

Hudson commented on MAPREDUCE-3790:
-----------------------------------

Integrated in Hadoop-Hdfs-0.23-Commit #594 (See [https://builds.apache.org/job/Hadoop-Hdfs-0.23-Commit/594/])
    svn merge -c 1294743 trunk to branch-0.23 FIXES MAPREDUCE-3790 Broken pipe on streaming job can lead to truncated output for a successful job (Jason Lowe via bobby) (Revision 1294747)

     Result = SUCCESS
bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1294747
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-tools/hadoop-streaming/src/main/java/org/apache/hadoop/streaming/PipeMapRed.java
* /hadoop/common/branches/branch-0.23/hadoop-tools/hadoop-streaming/src/test/java/org/apache/hadoop/streaming/OutputOnlyApp.java
* /hadoop/common/branches/branch-0.23/hadoop-tools/hadoop-streaming/src/test/java/org/apache/hadoop/streaming/TestUnconsumedInput.java

                
> Broken pipe on streaming job can lead to truncated output for a successful job
> ------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3790
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3790
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: contrib/streaming, mrv2
>    Affects Versions: 0.23.1, 0.24.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3790.patch
>
>
> If a streaming job doesn't consume all of its input then the job can be marked successful even though the job's output is truncated.
> Here's a simple setup that can exhibit the problem.  Note that the job output will most likely be truncated compared to the same job run with a zero-length input file.
> {code}
> $ hdfs dfs -cat in
> foo
> $ yarn jar ./share/hadoop/tools/lib/hadoop-streaming-0.24.0-SNAPSHOT.jar -Dmapred.map.tasks=1 -Dmapred.reduce.tasks=1 -mapper /bin/env -reducer NONE -input in -output out
> {code}
> Examining the map task log shows this:
> {code:title=Excerpt from map task stdout log}
> 2012-02-02 11:27:25,054 WARN [main] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: Broken pipe
> 2012-02-02 11:27:25,054 INFO [main] org.apache.hadoop.streaming.PipeMapRed: mapRedFinished
> 2012-02-02 11:27:25,056 WARN [Thread-12] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: Bad file descriptor
> 2012-02-02 11:27:25,124 INFO [main] org.apache.hadoop.mapred.Task: Task:attempt_1328203555769_0001_m_000000_0 is done. And is in the process of commiting
> 2012-02-02 11:27:25,127 WARN [Thread-11] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: DFSOutputStream is closed
> 2012-02-02 11:27:25,199 INFO [main] org.apache.hadoop.mapred.Task: Task attempt_1328203555769_0001_m_000000_0 is allowed to commit now
> 2012-02-02 11:27:25,225 INFO [main] org.apache.hadoop.mapred.FileOutputCommitter: Saved output of task 'attempt_1328203555769_0001_m_000000_0' to hdfs://localhost:9000/user/somebody/out/_temporary/1
> 2012-02-02 11:27:27,834 INFO [main] org.apache.hadoop.mapred.Task: Task 'attempt_1328203555769_0001_m_000000_0' done.
> {code}
> In PipeMapRed.mapRedFinished() we can see it will eat IOExceptions and return without waiting for the output threads or throwing a runtime exception to fail the job.  Net result is that the DFS streams could be shutdown too early if the output threads are still busy and we could lose job output.
> Fixing this brings up the bigger question of what *should* happen when a streaming job doesn't consume all of its input.  Should we have grabbed all of the output from the job and still marked it successful or should we have failed the job?  If the former then we need to fix some other places in the code as well, since feeding a much larger input file (e.g.: 600K) to the same sample streaming job results in the job failing with the exception below.  It wouldn't be consistent to fail the job that doesn't consume a lot of input but pass the job that leaves just a few leftovers.
> {code}
> 2012-02-02 10:29:37,220 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1270)) - Running job: job_1328200108174_0001
> 2012-02-02 10:29:44,354 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1291)) - Job job_1328200108174_0001 running in uber mode : false
> 2012-02-02 10:29:44,355 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1298)) -  map 0% reduce 0%
> 2012-02-02 10:29:46,394 INFO  mapreduce.Job (Job.java:printTaskEvents(1386)) - Task Id : attempt_1328200108174_0001_m_000000_0, Status : FAILED
> Error: java.io.IOException: Broken pipe
> 	at java.io.FileOutputStream.writeBytes(Native Method)
> 	at java.io.FileOutputStream.write(FileOutputStream.java:282)
> 	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
> 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> 	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
> 	at java.io.DataOutputStream.write(DataOutputStream.java:90)
> 	at org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72)
> 	at org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51)
> 	at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:106)
> 	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
> 	at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
> 	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:394)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:329)
> 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:147)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
> 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:142)
> {code}
> Assuming the job returns a successful exit code, I think we should allow the job to complete successfully even though it doesn't consume all of its inputs.  Part of the reasoning is that there's already this comment in PipeMapper.java that implies we desire that behavior:
> {code:title=PipeMapper.java}
>         // terminate with success:
>         // swallow input records although the stream processor failed/closed
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3790) Broken pipe on streaming job can lead to truncated output for a successful job

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13218510#comment-13218510 ] 

Hudson commented on MAPREDUCE-3790:
-----------------------------------

Integrated in Hadoop-Mapreduce-0.23-Commit #609 (See [https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Commit/609/])
    svn merge -c 1294750 trunk to branch-0.23 FIXES MAPREDUCE-3790 Broken pipe on streaming job can lead to truncated output for a successful job (Jason Lowe via bobby) (Revision 1294751)

     Result = ABORTED
bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1294751
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt

                
> Broken pipe on streaming job can lead to truncated output for a successful job
> ------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3790
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3790
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: contrib/streaming, mrv2
>    Affects Versions: 0.23.1, 0.24.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3790.patch
>
>
> If a streaming job doesn't consume all of its input then the job can be marked successful even though the job's output is truncated.
> Here's a simple setup that can exhibit the problem.  Note that the job output will most likely be truncated compared to the same job run with a zero-length input file.
> {code}
> $ hdfs dfs -cat in
> foo
> $ yarn jar ./share/hadoop/tools/lib/hadoop-streaming-0.24.0-SNAPSHOT.jar -Dmapred.map.tasks=1 -Dmapred.reduce.tasks=1 -mapper /bin/env -reducer NONE -input in -output out
> {code}
> Examining the map task log shows this:
> {code:title=Excerpt from map task stdout log}
> 2012-02-02 11:27:25,054 WARN [main] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: Broken pipe
> 2012-02-02 11:27:25,054 INFO [main] org.apache.hadoop.streaming.PipeMapRed: mapRedFinished
> 2012-02-02 11:27:25,056 WARN [Thread-12] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: Bad file descriptor
> 2012-02-02 11:27:25,124 INFO [main] org.apache.hadoop.mapred.Task: Task:attempt_1328203555769_0001_m_000000_0 is done. And is in the process of commiting
> 2012-02-02 11:27:25,127 WARN [Thread-11] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: DFSOutputStream is closed
> 2012-02-02 11:27:25,199 INFO [main] org.apache.hadoop.mapred.Task: Task attempt_1328203555769_0001_m_000000_0 is allowed to commit now
> 2012-02-02 11:27:25,225 INFO [main] org.apache.hadoop.mapred.FileOutputCommitter: Saved output of task 'attempt_1328203555769_0001_m_000000_0' to hdfs://localhost:9000/user/somebody/out/_temporary/1
> 2012-02-02 11:27:27,834 INFO [main] org.apache.hadoop.mapred.Task: Task 'attempt_1328203555769_0001_m_000000_0' done.
> {code}
> In PipeMapRed.mapRedFinished() we can see it will eat IOExceptions and return without waiting for the output threads or throwing a runtime exception to fail the job.  Net result is that the DFS streams could be shutdown too early if the output threads are still busy and we could lose job output.
> Fixing this brings up the bigger question of what *should* happen when a streaming job doesn't consume all of its input.  Should we have grabbed all of the output from the job and still marked it successful or should we have failed the job?  If the former then we need to fix some other places in the code as well, since feeding a much larger input file (e.g.: 600K) to the same sample streaming job results in the job failing with the exception below.  It wouldn't be consistent to fail the job that doesn't consume a lot of input but pass the job that leaves just a few leftovers.
> {code}
> 2012-02-02 10:29:37,220 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1270)) - Running job: job_1328200108174_0001
> 2012-02-02 10:29:44,354 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1291)) - Job job_1328200108174_0001 running in uber mode : false
> 2012-02-02 10:29:44,355 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1298)) -  map 0% reduce 0%
> 2012-02-02 10:29:46,394 INFO  mapreduce.Job (Job.java:printTaskEvents(1386)) - Task Id : attempt_1328200108174_0001_m_000000_0, Status : FAILED
> Error: java.io.IOException: Broken pipe
> 	at java.io.FileOutputStream.writeBytes(Native Method)
> 	at java.io.FileOutputStream.write(FileOutputStream.java:282)
> 	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
> 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> 	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
> 	at java.io.DataOutputStream.write(DataOutputStream.java:90)
> 	at org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72)
> 	at org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51)
> 	at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:106)
> 	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
> 	at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
> 	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:394)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:329)
> 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:147)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
> 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:142)
> {code}
> Assuming the job returns a successful exit code, I think we should allow the job to complete successfully even though it doesn't consume all of its inputs.  Part of the reasoning is that there's already this comment in PipeMapper.java that implies we desire that behavior:
> {code:title=PipeMapper.java}
>         // terminate with success:
>         // swallow input records although the stream processor failed/closed
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3790) Broken pipe on streaming job can lead to truncated output for a successful job

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13219166#comment-13219166 ] 

Hudson commented on MAPREDUCE-3790:
-----------------------------------

Integrated in Hadoop-Hdfs-0.23-Build #183 (See [https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/183/])
    svn merge -c 1294750 trunk to branch-0.23 FIXES MAPREDUCE-3790 Broken pipe on streaming job can lead to truncated output for a successful job (Jason Lowe via bobby) (Revision 1294751)
svn merge -c 1294743 trunk to branch-0.23 FIXES MAPREDUCE-3790 Broken pipe on streaming job can lead to truncated output for a successful job (Jason Lowe via bobby) (Revision 1294747)

     Result = FAILURE
bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1294751
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt

bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1294747
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-tools/hadoop-streaming/src/main/java/org/apache/hadoop/streaming/PipeMapRed.java
* /hadoop/common/branches/branch-0.23/hadoop-tools/hadoop-streaming/src/test/java/org/apache/hadoop/streaming/OutputOnlyApp.java
* /hadoop/common/branches/branch-0.23/hadoop-tools/hadoop-streaming/src/test/java/org/apache/hadoop/streaming/TestUnconsumedInput.java

                
> Broken pipe on streaming job can lead to truncated output for a successful job
> ------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3790
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3790
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: contrib/streaming, mrv2
>    Affects Versions: 0.23.1, 0.24.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3790.patch
>
>
> If a streaming job doesn't consume all of its input then the job can be marked successful even though the job's output is truncated.
> Here's a simple setup that can exhibit the problem.  Note that the job output will most likely be truncated compared to the same job run with a zero-length input file.
> {code}
> $ hdfs dfs -cat in
> foo
> $ yarn jar ./share/hadoop/tools/lib/hadoop-streaming-0.24.0-SNAPSHOT.jar -Dmapred.map.tasks=1 -Dmapred.reduce.tasks=1 -mapper /bin/env -reducer NONE -input in -output out
> {code}
> Examining the map task log shows this:
> {code:title=Excerpt from map task stdout log}
> 2012-02-02 11:27:25,054 WARN [main] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: Broken pipe
> 2012-02-02 11:27:25,054 INFO [main] org.apache.hadoop.streaming.PipeMapRed: mapRedFinished
> 2012-02-02 11:27:25,056 WARN [Thread-12] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: Bad file descriptor
> 2012-02-02 11:27:25,124 INFO [main] org.apache.hadoop.mapred.Task: Task:attempt_1328203555769_0001_m_000000_0 is done. And is in the process of commiting
> 2012-02-02 11:27:25,127 WARN [Thread-11] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: DFSOutputStream is closed
> 2012-02-02 11:27:25,199 INFO [main] org.apache.hadoop.mapred.Task: Task attempt_1328203555769_0001_m_000000_0 is allowed to commit now
> 2012-02-02 11:27:25,225 INFO [main] org.apache.hadoop.mapred.FileOutputCommitter: Saved output of task 'attempt_1328203555769_0001_m_000000_0' to hdfs://localhost:9000/user/somebody/out/_temporary/1
> 2012-02-02 11:27:27,834 INFO [main] org.apache.hadoop.mapred.Task: Task 'attempt_1328203555769_0001_m_000000_0' done.
> {code}
> In PipeMapRed.mapRedFinished() we can see it will eat IOExceptions and return without waiting for the output threads or throwing a runtime exception to fail the job.  Net result is that the DFS streams could be shutdown too early if the output threads are still busy and we could lose job output.
> Fixing this brings up the bigger question of what *should* happen when a streaming job doesn't consume all of its input.  Should we have grabbed all of the output from the job and still marked it successful or should we have failed the job?  If the former then we need to fix some other places in the code as well, since feeding a much larger input file (e.g.: 600K) to the same sample streaming job results in the job failing with the exception below.  It wouldn't be consistent to fail the job that doesn't consume a lot of input but pass the job that leaves just a few leftovers.
> {code}
> 2012-02-02 10:29:37,220 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1270)) - Running job: job_1328200108174_0001
> 2012-02-02 10:29:44,354 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1291)) - Job job_1328200108174_0001 running in uber mode : false
> 2012-02-02 10:29:44,355 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1298)) -  map 0% reduce 0%
> 2012-02-02 10:29:46,394 INFO  mapreduce.Job (Job.java:printTaskEvents(1386)) - Task Id : attempt_1328200108174_0001_m_000000_0, Status : FAILED
> Error: java.io.IOException: Broken pipe
> 	at java.io.FileOutputStream.writeBytes(Native Method)
> 	at java.io.FileOutputStream.write(FileOutputStream.java:282)
> 	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
> 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> 	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
> 	at java.io.DataOutputStream.write(DataOutputStream.java:90)
> 	at org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72)
> 	at org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51)
> 	at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:106)
> 	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
> 	at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
> 	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:394)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:329)
> 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:147)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
> 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:142)
> {code}
> Assuming the job returns a successful exit code, I think we should allow the job to complete successfully even though it doesn't consume all of its inputs.  Part of the reasoning is that there's already this comment in PipeMapper.java that implies we desire that behavior:
> {code:title=PipeMapper.java}
>         // terminate with success:
>         // swallow input records although the stream processor failed/closed
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3790) Broken pipe on streaming job can lead to truncated output for a successful job

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13218443#comment-13218443 ] 

Hudson commented on MAPREDUCE-3790:
-----------------------------------

Integrated in Hadoop-Mapreduce-trunk-Commit #1804 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/1804/])
    MAPREDUCE-3790 Broken pipe on streaming job can lead to truncated output for a successful job (Jason Lowe via bobby) (Revision 1294750)
MAPREDUCE-3790 Broken pipe on streaming job can lead to truncated output for a successful job (Jason Lowe via bobby) (Revision 1294743)

     Result = ABORTED
bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1294750
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt

bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1294743
Files : 
* /hadoop/common/trunk/hadoop-tools/hadoop-streaming/src/main/java/org/apache/hadoop/streaming/PipeMapRed.java
* /hadoop/common/trunk/hadoop-tools/hadoop-streaming/src/test/java/org/apache/hadoop/streaming/OutputOnlyApp.java
* /hadoop/common/trunk/hadoop-tools/hadoop-streaming/src/test/java/org/apache/hadoop/streaming/TestUnconsumedInput.java

                
> Broken pipe on streaming job can lead to truncated output for a successful job
> ------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3790
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3790
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: contrib/streaming, mrv2
>    Affects Versions: 0.23.1, 0.24.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3790.patch
>
>
> If a streaming job doesn't consume all of its input then the job can be marked successful even though the job's output is truncated.
> Here's a simple setup that can exhibit the problem.  Note that the job output will most likely be truncated compared to the same job run with a zero-length input file.
> {code}
> $ hdfs dfs -cat in
> foo
> $ yarn jar ./share/hadoop/tools/lib/hadoop-streaming-0.24.0-SNAPSHOT.jar -Dmapred.map.tasks=1 -Dmapred.reduce.tasks=1 -mapper /bin/env -reducer NONE -input in -output out
> {code}
> Examining the map task log shows this:
> {code:title=Excerpt from map task stdout log}
> 2012-02-02 11:27:25,054 WARN [main] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: Broken pipe
> 2012-02-02 11:27:25,054 INFO [main] org.apache.hadoop.streaming.PipeMapRed: mapRedFinished
> 2012-02-02 11:27:25,056 WARN [Thread-12] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: Bad file descriptor
> 2012-02-02 11:27:25,124 INFO [main] org.apache.hadoop.mapred.Task: Task:attempt_1328203555769_0001_m_000000_0 is done. And is in the process of commiting
> 2012-02-02 11:27:25,127 WARN [Thread-11] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: DFSOutputStream is closed
> 2012-02-02 11:27:25,199 INFO [main] org.apache.hadoop.mapred.Task: Task attempt_1328203555769_0001_m_000000_0 is allowed to commit now
> 2012-02-02 11:27:25,225 INFO [main] org.apache.hadoop.mapred.FileOutputCommitter: Saved output of task 'attempt_1328203555769_0001_m_000000_0' to hdfs://localhost:9000/user/somebody/out/_temporary/1
> 2012-02-02 11:27:27,834 INFO [main] org.apache.hadoop.mapred.Task: Task 'attempt_1328203555769_0001_m_000000_0' done.
> {code}
> In PipeMapRed.mapRedFinished() we can see it will eat IOExceptions and return without waiting for the output threads or throwing a runtime exception to fail the job.  Net result is that the DFS streams could be shutdown too early if the output threads are still busy and we could lose job output.
> Fixing this brings up the bigger question of what *should* happen when a streaming job doesn't consume all of its input.  Should we have grabbed all of the output from the job and still marked it successful or should we have failed the job?  If the former then we need to fix some other places in the code as well, since feeding a much larger input file (e.g.: 600K) to the same sample streaming job results in the job failing with the exception below.  It wouldn't be consistent to fail the job that doesn't consume a lot of input but pass the job that leaves just a few leftovers.
> {code}
> 2012-02-02 10:29:37,220 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1270)) - Running job: job_1328200108174_0001
> 2012-02-02 10:29:44,354 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1291)) - Job job_1328200108174_0001 running in uber mode : false
> 2012-02-02 10:29:44,355 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1298)) -  map 0% reduce 0%
> 2012-02-02 10:29:46,394 INFO  mapreduce.Job (Job.java:printTaskEvents(1386)) - Task Id : attempt_1328200108174_0001_m_000000_0, Status : FAILED
> Error: java.io.IOException: Broken pipe
> 	at java.io.FileOutputStream.writeBytes(Native Method)
> 	at java.io.FileOutputStream.write(FileOutputStream.java:282)
> 	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
> 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> 	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
> 	at java.io.DataOutputStream.write(DataOutputStream.java:90)
> 	at org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72)
> 	at org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51)
> 	at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:106)
> 	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
> 	at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
> 	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:394)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:329)
> 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:147)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
> 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:142)
> {code}
> Assuming the job returns a successful exit code, I think we should allow the job to complete successfully even though it doesn't consume all of its inputs.  Part of the reasoning is that there's already this comment in PipeMapper.java that implies we desire that behavior:
> {code:title=PipeMapper.java}
>         // terminate with success:
>         // swallow input records although the stream processor failed/closed
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-3790) Broken pipe on streaming job can lead to truncated output for a successful job

Posted by "Jason Lowe (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Lowe updated MAPREDUCE-3790:
----------------------------------

         Component/s: mrv2
    Target Version/s: 0.23.1, 0.24.0  (was: 0.24.0, 0.23.1)
    
> Broken pipe on streaming job can lead to truncated output for a successful job
> ------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3790
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3790
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: contrib/streaming, mrv2
>    Affects Versions: 0.23.1, 0.24.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>         Attachments: MAPREDUCE-3790.patch
>
>
> If a streaming job doesn't consume all of its input then the job can be marked successful even though the job's output is truncated.
> Here's a simple setup that can exhibit the problem.  Note that the job output will most likely be truncated compared to the same job run with a zero-length input file.
> {code}
> $ hdfs dfs -cat in
> foo
> $ yarn jar ./share/hadoop/tools/lib/hadoop-streaming-0.24.0-SNAPSHOT.jar -Dmapred.map.tasks=1 -Dmapred.reduce.tasks=1 -mapper /bin/env -reducer NONE -input in -output out
> {code}
> Examining the map task log shows this:
> {code:title=Excerpt from map task stdout log}
> 2012-02-02 11:27:25,054 WARN [main] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: Broken pipe
> 2012-02-02 11:27:25,054 INFO [main] org.apache.hadoop.streaming.PipeMapRed: mapRedFinished
> 2012-02-02 11:27:25,056 WARN [Thread-12] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: Bad file descriptor
> 2012-02-02 11:27:25,124 INFO [main] org.apache.hadoop.mapred.Task: Task:attempt_1328203555769_0001_m_000000_0 is done. And is in the process of commiting
> 2012-02-02 11:27:25,127 WARN [Thread-11] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: DFSOutputStream is closed
> 2012-02-02 11:27:25,199 INFO [main] org.apache.hadoop.mapred.Task: Task attempt_1328203555769_0001_m_000000_0 is allowed to commit now
> 2012-02-02 11:27:25,225 INFO [main] org.apache.hadoop.mapred.FileOutputCommitter: Saved output of task 'attempt_1328203555769_0001_m_000000_0' to hdfs://localhost:9000/user/somebody/out/_temporary/1
> 2012-02-02 11:27:27,834 INFO [main] org.apache.hadoop.mapred.Task: Task 'attempt_1328203555769_0001_m_000000_0' done.
> {code}
> In PipeMapRed.mapRedFinished() we can see it will eat IOExceptions and return without waiting for the output threads or throwing a runtime exception to fail the job.  Net result is that the DFS streams could be shutdown too early if the output threads are still busy and we could lose job output.
> Fixing this brings up the bigger question of what *should* happen when a streaming job doesn't consume all of its input.  Should we have grabbed all of the output from the job and still marked it successful or should we have failed the job?  If the former then we need to fix some other places in the code as well, since feeding a much larger input file (e.g.: 600K) to the same sample streaming job results in the job failing with the exception below.  It wouldn't be consistent to fail the job that doesn't consume a lot of input but pass the job that leaves just a few leftovers.
> {code}
> 2012-02-02 10:29:37,220 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1270)) - Running job: job_1328200108174_0001
> 2012-02-02 10:29:44,354 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1291)) - Job job_1328200108174_0001 running in uber mode : false
> 2012-02-02 10:29:44,355 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1298)) -  map 0% reduce 0%
> 2012-02-02 10:29:46,394 INFO  mapreduce.Job (Job.java:printTaskEvents(1386)) - Task Id : attempt_1328200108174_0001_m_000000_0, Status : FAILED
> Error: java.io.IOException: Broken pipe
> 	at java.io.FileOutputStream.writeBytes(Native Method)
> 	at java.io.FileOutputStream.write(FileOutputStream.java:282)
> 	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
> 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> 	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
> 	at java.io.DataOutputStream.write(DataOutputStream.java:90)
> 	at org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72)
> 	at org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51)
> 	at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:106)
> 	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
> 	at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
> 	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:394)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:329)
> 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:147)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
> 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:142)
> {code}
> Assuming the job returns a successful exit code, I think we should allow the job to complete successfully even though it doesn't consume all of its inputs.  Part of the reasoning is that there's already this comment in PipeMapper.java that implies we desire that behavior:
> {code:title=PipeMapper.java}
>         // terminate with success:
>         // swallow input records although the stream processor failed/closed
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-3790) Broken pipe on streaming job can lead to truncated output for a successful job

Posted by "Jason Lowe (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Lowe updated MAPREDUCE-3790:
----------------------------------

            Assignee: Jason Lowe
    Target Version/s: 0.23.1, 0.24.0
              Status: Patch Available  (was: Open)
    
> Broken pipe on streaming job can lead to truncated output for a successful job
> ------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3790
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3790
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: contrib/streaming
>    Affects Versions: 0.23.1, 0.24.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>         Attachments: MAPREDUCE-3790.patch
>
>
> If a streaming job doesn't consume all of its input then the job can be marked successful even though the job's output is truncated.
> Here's a simple setup that can exhibit the problem.  Note that the job output will most likely be truncated compared to the same job run with a zero-length input file.
> {code}
> $ hdfs dfs -cat in
> foo
> $ yarn jar ./share/hadoop/tools/lib/hadoop-streaming-0.24.0-SNAPSHOT.jar -Dmapred.map.tasks=1 -Dmapred.reduce.tasks=1 -mapper /bin/env -reducer NONE -input in -output out
> {code}
> Examining the map task log shows this:
> {code:title=Excerpt from map task stdout log}
> 2012-02-02 11:27:25,054 WARN [main] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: Broken pipe
> 2012-02-02 11:27:25,054 INFO [main] org.apache.hadoop.streaming.PipeMapRed: mapRedFinished
> 2012-02-02 11:27:25,056 WARN [Thread-12] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: Bad file descriptor
> 2012-02-02 11:27:25,124 INFO [main] org.apache.hadoop.mapred.Task: Task:attempt_1328203555769_0001_m_000000_0 is done. And is in the process of commiting
> 2012-02-02 11:27:25,127 WARN [Thread-11] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: DFSOutputStream is closed
> 2012-02-02 11:27:25,199 INFO [main] org.apache.hadoop.mapred.Task: Task attempt_1328203555769_0001_m_000000_0 is allowed to commit now
> 2012-02-02 11:27:25,225 INFO [main] org.apache.hadoop.mapred.FileOutputCommitter: Saved output of task 'attempt_1328203555769_0001_m_000000_0' to hdfs://localhost:9000/user/somebody/out/_temporary/1
> 2012-02-02 11:27:27,834 INFO [main] org.apache.hadoop.mapred.Task: Task 'attempt_1328203555769_0001_m_000000_0' done.
> {code}
> In PipeMapRed.mapRedFinished() we can see it will eat IOExceptions and return without waiting for the output threads or throwing a runtime exception to fail the job.  Net result is that the DFS streams could be shutdown too early if the output threads are still busy and we could lose job output.
> Fixing this brings up the bigger question of what *should* happen when a streaming job doesn't consume all of its input.  Should we have grabbed all of the output from the job and still marked it successful or should we have failed the job?  If the former then we need to fix some other places in the code as well, since feeding a much larger input file (e.g.: 600K) to the same sample streaming job results in the job failing with the exception below.  It wouldn't be consistent to fail the job that doesn't consume a lot of input but pass the job that leaves just a few leftovers.
> {code}
> 2012-02-02 10:29:37,220 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1270)) - Running job: job_1328200108174_0001
> 2012-02-02 10:29:44,354 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1291)) - Job job_1328200108174_0001 running in uber mode : false
> 2012-02-02 10:29:44,355 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1298)) -  map 0% reduce 0%
> 2012-02-02 10:29:46,394 INFO  mapreduce.Job (Job.java:printTaskEvents(1386)) - Task Id : attempt_1328200108174_0001_m_000000_0, Status : FAILED
> Error: java.io.IOException: Broken pipe
> 	at java.io.FileOutputStream.writeBytes(Native Method)
> 	at java.io.FileOutputStream.write(FileOutputStream.java:282)
> 	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
> 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> 	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
> 	at java.io.DataOutputStream.write(DataOutputStream.java:90)
> 	at org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72)
> 	at org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51)
> 	at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:106)
> 	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
> 	at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
> 	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:394)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:329)
> 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:147)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
> 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:142)
> {code}
> Assuming the job returns a successful exit code, I think we should allow the job to complete successfully even though it doesn't consume all of its inputs.  Part of the reasoning is that there's already this comment in PipeMapper.java that implies we desire that behavior:
> {code:title=PipeMapper.java}
>         // terminate with success:
>         // swallow input records although the stream processor failed/closed
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3790) Broken pipe on streaming job can lead to truncated output for a successful job

Posted by "Robert Joseph Evans (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13218380#comment-13218380 ] 

Robert Joseph Evans commented on MAPREDUCE-3790:
------------------------------------------------

The patch looks good to me +1.  I don't like just eating exceptions and dumping them to a log, but I don't see what else to do in this case.  The process has exited, and indicates by exiting that it does not want to process any more data, so it looks OK to me. 
                
> Broken pipe on streaming job can lead to truncated output for a successful job
> ------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3790
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3790
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: contrib/streaming, mrv2
>    Affects Versions: 0.23.1, 0.24.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>         Attachments: MAPREDUCE-3790.patch
>
>
> If a streaming job doesn't consume all of its input then the job can be marked successful even though the job's output is truncated.
> Here's a simple setup that can exhibit the problem.  Note that the job output will most likely be truncated compared to the same job run with a zero-length input file.
> {code}
> $ hdfs dfs -cat in
> foo
> $ yarn jar ./share/hadoop/tools/lib/hadoop-streaming-0.24.0-SNAPSHOT.jar -Dmapred.map.tasks=1 -Dmapred.reduce.tasks=1 -mapper /bin/env -reducer NONE -input in -output out
> {code}
> Examining the map task log shows this:
> {code:title=Excerpt from map task stdout log}
> 2012-02-02 11:27:25,054 WARN [main] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: Broken pipe
> 2012-02-02 11:27:25,054 INFO [main] org.apache.hadoop.streaming.PipeMapRed: mapRedFinished
> 2012-02-02 11:27:25,056 WARN [Thread-12] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: Bad file descriptor
> 2012-02-02 11:27:25,124 INFO [main] org.apache.hadoop.mapred.Task: Task:attempt_1328203555769_0001_m_000000_0 is done. And is in the process of commiting
> 2012-02-02 11:27:25,127 WARN [Thread-11] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: DFSOutputStream is closed
> 2012-02-02 11:27:25,199 INFO [main] org.apache.hadoop.mapred.Task: Task attempt_1328203555769_0001_m_000000_0 is allowed to commit now
> 2012-02-02 11:27:25,225 INFO [main] org.apache.hadoop.mapred.FileOutputCommitter: Saved output of task 'attempt_1328203555769_0001_m_000000_0' to hdfs://localhost:9000/user/somebody/out/_temporary/1
> 2012-02-02 11:27:27,834 INFO [main] org.apache.hadoop.mapred.Task: Task 'attempt_1328203555769_0001_m_000000_0' done.
> {code}
> In PipeMapRed.mapRedFinished() we can see it will eat IOExceptions and return without waiting for the output threads or throwing a runtime exception to fail the job.  Net result is that the DFS streams could be shutdown too early if the output threads are still busy and we could lose job output.
> Fixing this brings up the bigger question of what *should* happen when a streaming job doesn't consume all of its input.  Should we have grabbed all of the output from the job and still marked it successful or should we have failed the job?  If the former then we need to fix some other places in the code as well, since feeding a much larger input file (e.g.: 600K) to the same sample streaming job results in the job failing with the exception below.  It wouldn't be consistent to fail the job that doesn't consume a lot of input but pass the job that leaves just a few leftovers.
> {code}
> 2012-02-02 10:29:37,220 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1270)) - Running job: job_1328200108174_0001
> 2012-02-02 10:29:44,354 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1291)) - Job job_1328200108174_0001 running in uber mode : false
> 2012-02-02 10:29:44,355 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1298)) -  map 0% reduce 0%
> 2012-02-02 10:29:46,394 INFO  mapreduce.Job (Job.java:printTaskEvents(1386)) - Task Id : attempt_1328200108174_0001_m_000000_0, Status : FAILED
> Error: java.io.IOException: Broken pipe
> 	at java.io.FileOutputStream.writeBytes(Native Method)
> 	at java.io.FileOutputStream.write(FileOutputStream.java:282)
> 	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
> 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> 	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
> 	at java.io.DataOutputStream.write(DataOutputStream.java:90)
> 	at org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72)
> 	at org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51)
> 	at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:106)
> 	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
> 	at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
> 	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:394)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:329)
> 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:147)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
> 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:142)
> {code}
> Assuming the job returns a successful exit code, I think we should allow the job to complete successfully even though it doesn't consume all of its inputs.  Part of the reasoning is that there's already this comment in PipeMapper.java that implies we desire that behavior:
> {code:title=PipeMapper.java}
>         // terminate with success:
>         // swallow input records although the stream processor failed/closed
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira