You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Andrey Pankov <ap...@iponweb.net> on 2008/05/13 16:45:09 UTC

Streaming and subprocess error code

Hi all,

I'm looking a way to force Streaming to shutdown the whole job in case 
when some of its subprocesses exits with non-zero error code.

We have next situation. Sometimes either mapper or reducer could crush, 
as a rule it returns some exit code. In this case entire streaming job 
finishes successfully, but that's wrong. Almost the same when any 
subprocess finishes with segmentation fault.

It's possible to check automatically if a subprocess crushed only via 
logs but it means you need to parse tons of outputs/logs/dirs/etc.
In order to find logs of your job you have to know it's jobid ~ 
job_200805130853_0016. I don't know easy way to determine it - just scan 
stdout for the pattern. Then find logs of each mapper, each reducer, 
find a way to parse them, etc, etc...

So, is there any easiest way get correct status of the whole streaming 
job or I still have to build rather fragile parsing systems for such 
purposes?

Thanks in advance.

-- 
Andrey Pankov

Re: Reduce not completing

Posted by Rick Hangartner <ha...@strands.com>.

Hi Amareshwari,

The "stream.non.zero.exit.status.is failure" is the default (which the  
docs indicate is 'True').

We don't think the problem is the reducer script per se: Under one  
circumstance we are investigating further it arises when the reducer  
script does nothing but copy stdin to stdout, and when we include the  
directive "-jobconf mapred.reduce.tasks=0" in the command to the  
streaming facility.

Our streaming mapper and reducer scripts are in Python and we don't  
quite understand the directions in the streaming docs:

"To set a status,reporter:status:<message> should be sent to stderr"

Could you elaborate a little further?  In this case, the Python  
scripts are not throwing any exceptions or otherwise indicating an  
error, so we also don't have any internal error messages to send to  
stderr.

The other problem is that when we run the map-reduce with the same  
input multiple times it doesn't fail every time.  That's why our first  
suspicion was that it has something to do with timing between  
components of the M-R system because of the modest performance of our  
cluster.

Thanks,
RDH


On Dec 23, 2008, at 1:00 AM, Amareshwari Sriramadasu wrote:

> You can report status from streaming job by emitting  
> reporter:status:<message>  in stderr.
> See documentation @ http://hadoop.apache.org/core/docs/r0.18.2/streaming.html#How+do+I+update+status+in+streaming+applications%3F
>
> But from the exception trace, it doesn't look like lack of  
> report(timeout). The trace tells that reducer jvm process exited  
> with exit-code 1.
> It's mostly a bug in reducer code. What is the configuration value  
> of the property "stream.non.zero.exit.status.is.failure" ?
>
> Thanks
> Amareshwari
> Rick Hangartner wrote:
>> Hi,
>>
>> We seem to be seeing an runtime exception in the Reduce phase of a  
>> streaming Map-Reduce that has been mentioned before on this list.
>>
>> http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200805.mbox/%3C482BF75C.9030207@iponweb.net%3E
>>
>> When I Google the exception, the only thing returned is to this one  
>> short thread on the mailing list.  Unfortunately, we don't quite  
>> understand the exception message in our current situation or the  
>> eventual explanation and resolution of that previous case.
>>
>> We have tested that the Python script run in the Reduce phase runs  
>> without problems.  It returns the correct results when run from the  
>> command line fed from stdin by a file that is the output of the map  
>> phase for a small map-reduce job that fails in this way.
>>
>> Here's the exception we are seeing from the jobtracker log:
>>
>> 2008-12-22 18:13:36,415 INFO  
>> org.apache.hadoop.mapred.JobInProgress: Task  
>> 'attempt_200812221742_0004_m_000009_0' has completed  
>> task_200812221742_0004_m_000009 successfully.
>> 2008-12-22 18:13:50,607 INFO  
>> org.apache.hadoop.mapred.TaskInProgress: Error from  
>> attempt_200812221742_0004_r_000000_0: java.lang.RuntimeException:  
>> PipeMapRed.waitOutputThreads(): subprocess failed with code 1
>>    at  
>> org 
>> .apache 
>> .hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:301)
>>    at  
>> org 
>> .apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java: 
>> 518)
>>    at  
>> org.apache.hadoop.streaming.PipeReducer.reduce(PipeReducer.java:102)
>>    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318)
>>    at org.apache.hadoop.mapred.TaskTracker 
>> $Child.main(TaskTracker.java:2207)
>>
>> 2008-12-22 18:13:52,045 INFO org.apache.hadoop.mapred.JobTracker:  
>> Removed completed task 'attempt_200812221742_0004_r_000000_0' from  
>> 'tracker_hnode3.cor.mystrands.in:localhost/127.0.0.1:37777'
>> 2008-12-22 18:13:52,175 INFO org.apache.hadoop.mapred.JobTracker:  
>> Adding task 'attempt_200812221742_0004_r_000000_1' to tip  
>> task_200812221742_0004_r_000000, for tracker  
>> 'tracker_hnode5.cor.mystrands.in:localhost/127.0.0.1:55254'
>>
>> We typically see 4 repetitions of this exception in the log. And we  
>> may see 1-2 sets of those repetitions.
>>
>> If someone could explain what this exception actually means, and  
>> perhaps what we might need to change in our configuration to fix  
>> it, we would be most appreciative.   Naively, it almost seems if a  
>> task is just taking slightly too long to complete and report that  
>> fact, perhaps because of other Hadoop or MR processes going on at  
>> the same time.  If we re-run this map-reduce, it does sometimes run  
>> to a successful completion without an exception.
>>
>> We are just testing map-reduce as a candidate for a number of data  
>> reduction tasks right now.  We are running Hadoop 18.1 on a cluster  
>> of 9 retired desktop machines that just have 100Mb networking and  
>> about 2GB of RAM each, so that's why we are suspecting this could  
>> just be a problem that tasks are taking slightly too long to report  
>> back they have completed, rather than an actual bug.   (We will be  
>> upgrading this test  cluster to Hadoop 19.x and 1Gb networking very  
>> shortly.)
>>
>> Thanks,
>> RDH
>>
>> Begin forwarded message:
>>
>>> From: "Rick Cox" <ri...@gmail.com>
>>> Date: May 14, 2008 9:01:31 AM PDT
>>> To: core-user@hadoop.apache.org, apankov@iponweb.net
>>> Subject: Re: Streaming and subprocess error code
>>> Reply-To: core-user@hadoop.apache.org
>>>
>>> Does the syslog output from a should-have-failed task contain
>>> something like this?
>>>
>>>   java.lang.RuntimeException: PipeMapRed.waitOutputThreads():
>>> subprocess failed with code 1
>>>
>>> (In particular, I'm curious if it mentions the RuntimeException.)
>>>
>>> Tasks that consume all their input and then exit non-zero are
>>> definitely supposed to be counted as failed, so there's either a
>>> problem with the setup or a bug somewhere.
>>>
>>> rick
>>>
>>> On Wed, May 14, 2008 at 8:49 PM, Andrey Pankov  
>>> <ap...@iponweb.net> wrote:
>>>> Hi,
>>>>
>>>> I've tested this new option "-jobconf
>>>> stream.non.zero.exit.status.is.failure=true". Seems working but  
>>>> still not
>>>> good for me. When mapper/reducer program have read all input data
>>>> successfully and fails after that, streaming still finishes  
>>>> successfully so
>>>> there are no chances to know about some data post-processing  
>>>> errors in
>>>> subprocesses :(
>>>>
>>>>
>>>>
>>>> Andrey Pankov wrote:
>>>>
>>>>>
>>
>

Re: Reduce not completing

Posted by Amareshwari Sriramadasu <am...@yahoo-inc.com>.

I couldn't get much from the logs why it is so.

For reporting status, you can write to stderr from your script. The 
format should be reporter:status:<message>. If program emits such lines 
in stderr, the framework will treat them as status report. Hope this 
clarifies.

Thanks
Amareshwari
Rick Hangartner wrote:
> Hi Amareshwari,
>
> This may or may not be helpful.
>
> Here's an example of three runs in rapid succession.  The first and 
> last completed without any problems.  The middle one completed in this 
> case, but the log has three exceptions of the type we reported (We 
> think a fourth exception would have terminated the run based on what 
> we've seen in the past.)
>
> Here are excerpts from the progress reports for the three runs that we 
> see on the remote machine from which we are submitting the streaming job:
>
> 08/12/23 19:02:46 INFO streaming.StreamJob: Running job: 
> job_200812221742_0074
> 08/12/23 19:02:46 INFO streaming.StreamJob: To kill this job, run:
> 08/12/23 19:02:46 INFO streaming.StreamJob: 
> /Users/hadoop/hadoop/bin/../bin/hadoop job  
> -Dmapred.job.tracker=hnode9.cor.mystrands.in:9001 -kill 
> job_200812221742_0074
> 08/12/23 19:02:46 INFO streaming.StreamJob: Tracking URL: 
> http://hnode9.cor.mystrands.in:50030/jobdetails.jsp?jobid=job_200812221742_0074 
>
> 08/12/23 19:02:47 INFO streaming.StreamJob:  map 0%  reduce 0%
> 08/12/23 19:02:50 INFO streaming.StreamJob:  map 22%  reduce 0%
> 08/12/23 19:02:51 INFO streaming.StreamJob:  map 56%  reduce 0%
> 08/12/23 19:02:54 INFO streaming.StreamJob:  map 67%  reduce 0%
> 08/12/23 19:02:55 INFO streaming.StreamJob:  map 78%  reduce 0%
> 08/12/23 19:02:57 INFO streaming.StreamJob:  map 87%  reduce 0%
> 08/12/23 19:02:59 INFO streaming.StreamJob:  map 90%  reduce 0%
> 08/12/23 19:03:01 INFO streaming.StreamJob:  map 92%  reduce 0%
> 08/12/23 19:03:02 INFO streaming.StreamJob:  map 96%  reduce 22%
> 08/12/23 19:03:07 INFO streaming.StreamJob:  map 100%  reduce 26%
> 08/12/23 19:03:18 INFO streaming.StreamJob:  map 100%  reduce 73%
> 08/12/23 19:03:23 INFO streaming.StreamJob:  map 100%  reduce 91%
> 08/12/23 19:03:25 INFO streaming.StreamJob:  map 100%  reduce 100%
> 08/12/23 19:03:25 INFO streaming.StreamJob: Job complete: 
> job_200812221742_0074
>
> 08/12/23 19:04:10 INFO streaming.StreamJob: Running job: 
> job_200812221742_0075
> 08/12/23 19:04:10 INFO streaming.StreamJob: To kill this job, run:
> 08/12/23 19:04:10 INFO streaming.StreamJob: 
> /Users/hadoop/hadoop/bin/../bin/hadoop job  
> -Dmapred.job.tracker=hnode9.cor.mystrands.in:9001 -kill 
> job_200812221742_0075
> 08/12/23 19:04:10 INFO streaming.StreamJob: Tracking URL: 
> http://hnode9.cor.mystrands.in:50030/jobdetails.jsp?jobid=job_200812221742_0075 
>
> 08/12/23 19:04:11 INFO streaming.StreamJob:  map 0%  reduce 0%
> 08/12/23 19:04:14 INFO streaming.StreamJob:  map 33%  reduce 0%
> 08/12/23 19:04:15 INFO streaming.StreamJob:  map 56%  reduce 0%
> 08/12/23 19:04:16 INFO streaming.StreamJob:  map 67%  reduce 0%
> 08/12/23 19:04:17 INFO streaming.StreamJob:  map 71%  reduce 0%
> 08/12/23 19:04:20 INFO streaming.StreamJob:  map 87%  reduce 0%
> 08/12/23 19:04:23 INFO streaming.StreamJob:  map 89%  reduce 0%
> 08/12/23 19:04:24 INFO streaming.StreamJob:  map 95%  reduce 0%
> 08/12/23 19:04:26 INFO streaming.StreamJob:  map 98%  reduce 0%
> 08/12/23 19:04:28 INFO streaming.StreamJob:  map 100%  reduce 0%
> 08/12/23 19:04:30 INFO streaming.StreamJob:  map 100%  reduce 30%
> 08/12/23 19:04:41 INFO streaming.StreamJob:  map 100%  reduce 0%
> 08/12/23 19:05:08 INFO streaming.StreamJob:  map 100%  reduce 84%
> 08/12/23 19:05:13 INFO streaming.StreamJob:  map 100%  reduce 100%
> 08/12/23 19:05:14 INFO streaming.StreamJob: Job complete: 
> job_200812221742_0075
>
> 08/12/23 19:08:24 INFO streaming.StreamJob: Running job: 
> job_200812221742_0076
> 08/12/23 19:08:24 INFO streaming.StreamJob: To kill this job, run:
> 08/12/23 19:08:24 INFO streaming.StreamJob: 
> /Users/hadoop/hadoop/bin/../bin/hadoop job  
> -Dmapred.job.tracker=hnode9.cor.mystrands.in:9001 -kill 
> job_200812221742_0076
> 08/12/23 19:08:24 INFO streaming.StreamJob: Tracking URL: 
> http://hnode9.cor.mystrands.in:50030/jobdetails.jsp?jobid=job_200812221742_0076 
>
> 08/12/23 19:08:26 INFO streaming.StreamJob:  map 0%  reduce 0%
> 08/12/23 19:08:29 INFO streaming.StreamJob:  map 11%  reduce 0%
> 08/12/23 19:08:31 INFO streaming.StreamJob:  map 44%  reduce 0%
> 08/12/23 19:08:32 INFO streaming.StreamJob:  map 56%  reduce 0%
> 08/12/23 19:08:33 INFO streaming.StreamJob:  map 78%  reduce 0%
> 08/12/23 19:08:35 INFO streaming.StreamJob:  map 81%  reduce 0%
> 08/12/23 19:08:36 INFO streaming.StreamJob:  map 91%  reduce 0%
> 08/12/23 19:08:37 INFO streaming.StreamJob:  map 92%  reduce 0%
> 08/12/23 19:08:40 INFO streaming.StreamJob:  map 96%  reduce 19%
> 08/12/23 19:08:45 INFO streaming.StreamJob:  map 100%  reduce 19%
> 08/12/23 19:08:50 INFO streaming.StreamJob:  map 100%  reduce 30%
> 08/12/23 19:09:01 INFO streaming.StreamJob:  map 100%  reduce 85%
> 08/12/23 19:09:05 INFO streaming.StreamJob:  map 100%  reduce 100%
> 08/12/23 19:09:05 INFO streaming.StreamJob: Job complete: 
> job_200812221742_0076
>
> For the middle run, the job tracker long has the following exceptions 
> reported, but for the first and last the log does not list any 
> exceptions:
>
> 2008-12-23 19:04:27,816 INFO org.apache.hadoop.mapred.JobInProgress: 
> Task 'attempt_200812221742_0075_m_000000_0' has completed 
> task_200812221742_0075_m_000000 successfully.
> 2008-12-23 19:04:36,505 INFO org.apache.hadoop.mapred.TaskInProgress: 
> Error from attempt_200812221742_0075_r_000000_0: 
> java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess 
> failed with code 1
>     at 
> org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:301) 
>
>     at 
> org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:518) 
>
>     at 
> org.apache.hadoop.streaming.PipeReducer.reduce(PipeReducer.java:102)
>     at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318)
>     at 
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
>
> 2008-12-23 19:04:39,050 INFO org.apache.hadoop.mapred.JobTracker: 
> Removed completed task 'attempt_200812221742_0075_r_000000_0' from 
> 'tracker_hnode2.cor.mystrands.in:localhost/127.0.0.1:36968'
> 2008-12-23 19:04:39,214 INFO org.apache.hadoop.mapred.JobTracker: 
> Adding task 'attempt_200812221742_0075_r_000000_1' to tip 
> task_200812221742_0075_r_000000, for tracker 
> 'tracker_hnode3.cor.mystrands.in:localhost/127.0.0.1:37777'
> 2008-12-23 19:04:44,237 INFO org.apache.hadoop.mapred.TaskInProgress: 
> Error from attempt_200812221742_0075_r_000000_1: 
> java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess 
> failed with code 1
>     at 
> org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:301) 
>
>     at 
> org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:518) 
>
>     at 
> org.apache.hadoop.streaming.PipeReducer.reduce(PipeReducer.java:102)
>     at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318)
>     at 
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
>
> 2008-12-23 19:04:47,845 INFO org.apache.hadoop.mapred.JobTracker: 
> Removed completed task 'attempt_200812221742_0075_r_000000_1' from 
> 'tracker_hnode3.cor.mystrands.in:localhost/127.0.0.1:37777'
> 2008-12-23 19:04:47,856 INFO org.apache.hadoop.mapred.JobTracker: 
> Adding task 'attempt_200812221742_0075_r_000000_2' to tip 
> task_200812221742_0075_r_000000, for tracker 
> 'tracker_hnode1.cor.mystrands.in:localhost/127.0.0.1:37971'
> 2008-12-23 19:04:57,781 INFO org.apache.hadoop.mapred.TaskInProgress: 
> Error from attempt_200812221742_0075_r_000000_2: 
> java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess 
> failed with code 1
>     at 
> org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:301) 
>
>     at 
> org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:518) 
>
>     at 
> org.apache.hadoop.streaming.PipeReducer.reduce(PipeReducer.java:102)
>     at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318)
>     at 
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
>
> 2008-12-23 19:04:57,781 INFO org.apache.hadoop.mapred.JobTracker: 
> Removed completed task 'attempt_200812221742_0075_r_000000_2' from 
> 'tracker_hnode1.cor.mystrands.in:localhost/127.0.0.1:37971'
>
> Thanks,
> RDH
>
> On Dec 23, 2008, at 1:00 AM, Amareshwari Sriramadasu wrote:
>
>> You can report status from streaming job by emitting 
>> reporter:status:<message>  in stderr.
>> See documentation @ 
>> http://hadoop.apache.org/core/docs/r0.18.2/streaming.html#How+do+I+update+status+in+streaming+applications%3F 
>>
>>
>> But from the exception trace, it doesn't look like lack of 
>> report(timeout). The trace tells that reducer jvm process exited with 
>> exit-code 1.
>> It's mostly a bug in reducer code. What is the configuration value of 
>> the property "stream.non.zero.exit.status.is.failure" ?
>>
>> Thanks
>> Amareshwari
>> Rick Hangartner wrote:
>>> Hi,
>>>
>>> We seem to be seeing an runtime exception in the Reduce phase of a 
>>> streaming Map-Reduce that has been mentioned before on this list.
>>>
>>> http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200805.mbox/%3C482BF75C.9030207@iponweb.net%3E 
>>>
>>>
>>> When I Google the exception, the only thing returned is to this one 
>>> short thread on the mailing list.  Unfortunately, we don't quite 
>>> understand the exception message in our current situation or the 
>>> eventual explanation and resolution of that previous case.
>>>
>>> We have tested that the Python script run in the Reduce phase runs 
>>> without problems.  It returns the correct results when run from the 
>>> command line fed from stdin by a file that is the output of the map 
>>> phase for a small map-reduce job that fails in this way.
>>>
>>> Here's the exception we are seeing from the jobtracker log:
>>>
>>> 2008-12-22 18:13:36,415 INFO org.apache.hadoop.mapred.JobInProgress: 
>>> Task 'attempt_200812221742_0004_m_000009_0' has completed 
>>> task_200812221742_0004_m_000009 successfully.
>>> 2008-12-22 18:13:50,607 INFO 
>>> org.apache.hadoop.mapred.TaskInProgress: Error from 
>>> attempt_200812221742_0004_r_000000_0: java.lang.RuntimeException: 
>>> PipeMapRed.waitOutputThreads(): subprocess failed with code 1
>>>    at 
>>> org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:301) 
>>>
>>>    at 
>>> org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:518) 
>>>
>>>    at 
>>> org.apache.hadoop.streaming.PipeReducer.reduce(PipeReducer.java:102)
>>>    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318)
>>>    at 
>>> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
>>>
>>> 2008-12-22 18:13:52,045 INFO org.apache.hadoop.mapred.JobTracker: 
>>> Removed completed task 'attempt_200812221742_0004_r_000000_0' from 
>>> 'tracker_hnode3.cor.mystrands.in:localhost/127.0.0.1:37777'
>>> 2008-12-22 18:13:52,175 INFO org.apache.hadoop.mapred.JobTracker: 
>>> Adding task 'attempt_200812221742_0004_r_000000_1' to tip 
>>> task_200812221742_0004_r_000000, for tracker 
>>> 'tracker_hnode5.cor.mystrands.in:localhost/127.0.0.1:55254'
>>>
>>> We typically see 4 repetitions of this exception in the log. And we 
>>> may see 1-2 sets of those repetitions.
>>>
>>> If someone could explain what this exception actually means, and 
>>> perhaps what we might need to change in our configuration to fix it, 
>>> we would be most appreciative.   Naively, it almost seems if a task 
>>> is just taking slightly too long to complete and report that fact, 
>>> perhaps because of other Hadoop or MR processes going on at the same 
>>> time.  If we re-run this map-reduce, it does sometimes run to a 
>>> successful completion without an exception.
>>>
>>> We are just testing map-reduce as a candidate for a number of data 
>>> reduction tasks right now.  We are running Hadoop 18.1 on a cluster 
>>> of 9 retired desktop machines that just have 100Mb networking and 
>>> about 2GB of RAM each, so that's why we are suspecting this could 
>>> just be a problem that tasks are taking slightly too long to report 
>>> back they have completed, rather than an actual bug.   (We will be 
>>> upgrading this test  cluster to Hadoop 19.x and 1Gb networking very 
>>> shortly.)
>>>
>>> Thanks,
>>> RDH
>>>
>>> Begin forwarded message:
>>>
>>>> From: "Rick Cox" <ri...@gmail.com>
>>>> Date: May 14, 2008 9:01:31 AM PDT
>>>> To: core-user@hadoop.apache.org, apankov@iponweb.net
>>>> Subject: Re: Streaming and subprocess error code
>>>> Reply-To: core-user@hadoop.apache.org
>>>>
>>>> Does the syslog output from a should-have-failed task contain
>>>> something like this?
>>>>
>>>>   java.lang.RuntimeException: PipeMapRed.waitOutputThreads():
>>>> subprocess failed with code 1
>>>>
>>>> (In particular, I'm curious if it mentions the RuntimeException.)
>>>>
>>>> Tasks that consume all their input and then exit non-zero are
>>>> definitely supposed to be counted as failed, so there's either a
>>>> problem with the setup or a bug somewhere.
>>>>
>>>> rick
>>>>
>>>> On Wed, May 14, 2008 at 8:49 PM, Andrey Pankov 
>>>> <ap...@iponweb.net> wrote:
>>>>> Hi,
>>>>>
>>>>> I've tested this new option "-jobconf
>>>>> stream.non.zero.exit.status.is.failure=true". Seems working but 
>>>>> still not
>>>>> good for me. When mapper/reducer program have read all input data
>>>>> successfully and fails after that, streaming still finishes 
>>>>> successfully so
>>>>> there are no chances to know about some data post-processing 
>>>>> errors in
>>>>> subprocesses :(
>>>>>
>>>>>
>>>>>
>>>>> Andrey Pankov wrote:
>>>>>
>>>>>>
>>>
>>
>
>

Re: Reduce not completing

Posted by Rick Hangartner <ha...@strands.com>.

Hi Amareshwari,

This may or may not be helpful.

Here's an example of three runs in rapid succession.  The first and  
last completed without any problems.  The middle one completed in this  
case, but the log has three exceptions of the type we reported (We  
think a fourth exception would have terminated the run based on what  
we've seen in the past.)

Here are excerpts from the progress reports for the three runs that we  
see on the remote machine from which we are submitting the streaming  
job:

08/12/23 19:02:46 INFO streaming.StreamJob: Running job:  
job_200812221742_0074
08/12/23 19:02:46 INFO streaming.StreamJob: To kill this job, run:
08/12/23 19:02:46 INFO streaming.StreamJob: /Users/hadoop/hadoop/ 
bin/../bin/hadoop job  -Dmapred.job.tracker=hnode9.cor.mystrands.in: 
9001 -kill job_200812221742_0074
08/12/23 19:02:46 INFO streaming.StreamJob: Tracking URL: http://hnode9.cor.mystrands.in:50030/jobdetails.jsp?jobid=job_200812221742_0074
08/12/23 19:02:47 INFO streaming.StreamJob:  map 0%  reduce 0%
08/12/23 19:02:50 INFO streaming.StreamJob:  map 22%  reduce 0%
08/12/23 19:02:51 INFO streaming.StreamJob:  map 56%  reduce 0%
08/12/23 19:02:54 INFO streaming.StreamJob:  map 67%  reduce 0%
08/12/23 19:02:55 INFO streaming.StreamJob:  map 78%  reduce 0%
08/12/23 19:02:57 INFO streaming.StreamJob:  map 87%  reduce 0%
08/12/23 19:02:59 INFO streaming.StreamJob:  map 90%  reduce 0%
08/12/23 19:03:01 INFO streaming.StreamJob:  map 92%  reduce 0%
08/12/23 19:03:02 INFO streaming.StreamJob:  map 96%  reduce 22%
08/12/23 19:03:07 INFO streaming.StreamJob:  map 100%  reduce 26%
08/12/23 19:03:18 INFO streaming.StreamJob:  map 100%  reduce 73%
08/12/23 19:03:23 INFO streaming.StreamJob:  map 100%  reduce 91%
08/12/23 19:03:25 INFO streaming.StreamJob:  map 100%  reduce 100%
08/12/23 19:03:25 INFO streaming.StreamJob: Job complete:  
job_200812221742_0074

08/12/23 19:04:10 INFO streaming.StreamJob: Running job:  
job_200812221742_0075
08/12/23 19:04:10 INFO streaming.StreamJob: To kill this job, run:
08/12/23 19:04:10 INFO streaming.StreamJob: /Users/hadoop/hadoop/ 
bin/../bin/hadoop job  -Dmapred.job.tracker=hnode9.cor.mystrands.in: 
9001 -kill job_200812221742_0075
08/12/23 19:04:10 INFO streaming.StreamJob: Tracking URL: http://hnode9.cor.mystrands.in:50030/jobdetails.jsp?jobid=job_200812221742_0075
08/12/23 19:04:11 INFO streaming.StreamJob:  map 0%  reduce 0%
08/12/23 19:04:14 INFO streaming.StreamJob:  map 33%  reduce 0%
08/12/23 19:04:15 INFO streaming.StreamJob:  map 56%  reduce 0%
08/12/23 19:04:16 INFO streaming.StreamJob:  map 67%  reduce 0%
08/12/23 19:04:17 INFO streaming.StreamJob:  map 71%  reduce 0%
08/12/23 19:04:20 INFO streaming.StreamJob:  map 87%  reduce 0%
08/12/23 19:04:23 INFO streaming.StreamJob:  map 89%  reduce 0%
08/12/23 19:04:24 INFO streaming.StreamJob:  map 95%  reduce 0%
08/12/23 19:04:26 INFO streaming.StreamJob:  map 98%  reduce 0%
08/12/23 19:04:28 INFO streaming.StreamJob:  map 100%  reduce 0%
08/12/23 19:04:30 INFO streaming.StreamJob:  map 100%  reduce 30%
08/12/23 19:04:41 INFO streaming.StreamJob:  map 100%  reduce 0%
08/12/23 19:05:08 INFO streaming.StreamJob:  map 100%  reduce 84%
08/12/23 19:05:13 INFO streaming.StreamJob:  map 100%  reduce 100%
08/12/23 19:05:14 INFO streaming.StreamJob: Job complete:  
job_200812221742_0075

08/12/23 19:08:24 INFO streaming.StreamJob: Running job:  
job_200812221742_0076
08/12/23 19:08:24 INFO streaming.StreamJob: To kill this job, run:
08/12/23 19:08:24 INFO streaming.StreamJob: /Users/hadoop/hadoop/ 
bin/../bin/hadoop job  -Dmapred.job.tracker=hnode9.cor.mystrands.in: 
9001 -kill job_200812221742_0076
08/12/23 19:08:24 INFO streaming.StreamJob: Tracking URL: http://hnode9.cor.mystrands.in:50030/jobdetails.jsp?jobid=job_200812221742_0076
08/12/23 19:08:26 INFO streaming.StreamJob:  map 0%  reduce 0%
08/12/23 19:08:29 INFO streaming.StreamJob:  map 11%  reduce 0%
08/12/23 19:08:31 INFO streaming.StreamJob:  map 44%  reduce 0%
08/12/23 19:08:32 INFO streaming.StreamJob:  map 56%  reduce 0%
08/12/23 19:08:33 INFO streaming.StreamJob:  map 78%  reduce 0%
08/12/23 19:08:35 INFO streaming.StreamJob:  map 81%  reduce 0%
08/12/23 19:08:36 INFO streaming.StreamJob:  map 91%  reduce 0%
08/12/23 19:08:37 INFO streaming.StreamJob:  map 92%  reduce 0%
08/12/23 19:08:40 INFO streaming.StreamJob:  map 96%  reduce 19%
08/12/23 19:08:45 INFO streaming.StreamJob:  map 100%  reduce 19%
08/12/23 19:08:50 INFO streaming.StreamJob:  map 100%  reduce 30%
08/12/23 19:09:01 INFO streaming.StreamJob:  map 100%  reduce 85%
08/12/23 19:09:05 INFO streaming.StreamJob:  map 100%  reduce 100%
08/12/23 19:09:05 INFO streaming.StreamJob: Job complete:  
job_200812221742_0076

For the middle run, the job tracker long has the following exceptions  
reported, but for the first and last the log does not list any  
exceptions:

2008-12-23 19:04:27,816 INFO org.apache.hadoop.mapred.JobInProgress:  
Task 'attempt_200812221742_0075_m_000000_0' has completed  
task_200812221742_0075_m_000000 successfully.
2008-12-23 19:04:36,505 INFO org.apache.hadoop.mapred.TaskInProgress:  
Error from attempt_200812221742_0075_r_000000_0:  
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess  
failed with code 1
	at  
org 
.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java: 
301)
	at  
org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java: 
518)
	at org.apache.hadoop.streaming.PipeReducer.reduce(PipeReducer.java:102)
	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318)
	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java: 
2207)

2008-12-23 19:04:39,050 INFO org.apache.hadoop.mapred.JobTracker:  
Removed completed task 'attempt_200812221742_0075_r_000000_0' from  
'tracker_hnode2.cor.mystrands.in:localhost/127.0.0.1:36968'
2008-12-23 19:04:39,214 INFO org.apache.hadoop.mapred.JobTracker:  
Adding task 'attempt_200812221742_0075_r_000000_1' to tip  
task_200812221742_0075_r_000000, for tracker  
'tracker_hnode3.cor.mystrands.in:localhost/127.0.0.1:37777'
2008-12-23 19:04:44,237 INFO org.apache.hadoop.mapred.TaskInProgress:  
Error from attempt_200812221742_0075_r_000000_1:  
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess  
failed with code 1
	at  
org 
.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java: 
301)
	at  
org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java: 
518)
	at org.apache.hadoop.streaming.PipeReducer.reduce(PipeReducer.java:102)
	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318)
	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java: 
2207)

2008-12-23 19:04:47,845 INFO org.apache.hadoop.mapred.JobTracker:  
Removed completed task 'attempt_200812221742_0075_r_000000_1' from  
'tracker_hnode3.cor.mystrands.in:localhost/127.0.0.1:37777'
2008-12-23 19:04:47,856 INFO org.apache.hadoop.mapred.JobTracker:  
Adding task 'attempt_200812221742_0075_r_000000_2' to tip  
task_200812221742_0075_r_000000, for tracker  
'tracker_hnode1.cor.mystrands.in:localhost/127.0.0.1:37971'
2008-12-23 19:04:57,781 INFO org.apache.hadoop.mapred.TaskInProgress:  
Error from attempt_200812221742_0075_r_000000_2:  
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess  
failed with code 1
	at  
org 
.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java: 
301)
	at  
org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java: 
518)
	at org.apache.hadoop.streaming.PipeReducer.reduce(PipeReducer.java:102)
	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318)
	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java: 
2207)

2008-12-23 19:04:57,781 INFO org.apache.hadoop.mapred.JobTracker:  
Removed completed task 'attempt_200812221742_0075_r_000000_2' from  
'tracker_hnode1.cor.mystrands.in:localhost/127.0.0.1:37971'

Thanks,
RDH

On Dec 23, 2008, at 1:00 AM, Amareshwari Sriramadasu wrote:

> You can report status from streaming job by emitting  
> reporter:status:<message>  in stderr.
> See documentation @ http://hadoop.apache.org/core/docs/r0.18.2/streaming.html#How+do+I+update+status+in+streaming+applications%3F
>
> But from the exception trace, it doesn't look like lack of  
> report(timeout). The trace tells that reducer jvm process exited  
> with exit-code 1.
> It's mostly a bug in reducer code. What is the configuration value  
> of the property "stream.non.zero.exit.status.is.failure" ?
>
> Thanks
> Amareshwari
> Rick Hangartner wrote:
>> Hi,
>>
>> We seem to be seeing an runtime exception in the Reduce phase of a  
>> streaming Map-Reduce that has been mentioned before on this list.
>>
>> http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200805.mbox/%3C482BF75C.9030207@iponweb.net%3E
>>
>> When I Google the exception, the only thing returned is to this one  
>> short thread on the mailing list.  Unfortunately, we don't quite  
>> understand the exception message in our current situation or the  
>> eventual explanation and resolution of that previous case.
>>
>> We have tested that the Python script run in the Reduce phase runs  
>> without problems.  It returns the correct results when run from the  
>> command line fed from stdin by a file that is the output of the map  
>> phase for a small map-reduce job that fails in this way.
>>
>> Here's the exception we are seeing from the jobtracker log:
>>
>> 2008-12-22 18:13:36,415 INFO  
>> org.apache.hadoop.mapred.JobInProgress: Task  
>> 'attempt_200812221742_0004_m_000009_0' has completed  
>> task_200812221742_0004_m_000009 successfully.
>> 2008-12-22 18:13:50,607 INFO  
>> org.apache.hadoop.mapred.TaskInProgress: Error from  
>> attempt_200812221742_0004_r_000000_0: java.lang.RuntimeException:  
>> PipeMapRed.waitOutputThreads(): subprocess failed with code 1
>>    at  
>> org 
>> .apache 
>> .hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:301)
>>    at  
>> org 
>> .apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java: 
>> 518)
>>    at  
>> org.apache.hadoop.streaming.PipeReducer.reduce(PipeReducer.java:102)
>>    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318)
>>    at org.apache.hadoop.mapred.TaskTracker 
>> $Child.main(TaskTracker.java:2207)
>>
>> 2008-12-22 18:13:52,045 INFO org.apache.hadoop.mapred.JobTracker:  
>> Removed completed task 'attempt_200812221742_0004_r_000000_0' from  
>> 'tracker_hnode3.cor.mystrands.in:localhost/127.0.0.1:37777'
>> 2008-12-22 18:13:52,175 INFO org.apache.hadoop.mapred.JobTracker:  
>> Adding task 'attempt_200812221742_0004_r_000000_1' to tip  
>> task_200812221742_0004_r_000000, for tracker  
>> 'tracker_hnode5.cor.mystrands.in:localhost/127.0.0.1:55254'
>>
>> We typically see 4 repetitions of this exception in the log. And we  
>> may see 1-2 sets of those repetitions.
>>
>> If someone could explain what this exception actually means, and  
>> perhaps what we might need to change in our configuration to fix  
>> it, we would be most appreciative.   Naively, it almost seems if a  
>> task is just taking slightly too long to complete and report that  
>> fact, perhaps because of other Hadoop or MR processes going on at  
>> the same time.  If we re-run this map-reduce, it does sometimes run  
>> to a successful completion without an exception.
>>
>> We are just testing map-reduce as a candidate for a number of data  
>> reduction tasks right now.  We are running Hadoop 18.1 on a cluster  
>> of 9 retired desktop machines that just have 100Mb networking and  
>> about 2GB of RAM each, so that's why we are suspecting this could  
>> just be a problem that tasks are taking slightly too long to report  
>> back they have completed, rather than an actual bug.   (We will be  
>> upgrading this test  cluster to Hadoop 19.x and 1Gb networking very  
>> shortly.)
>>
>> Thanks,
>> RDH
>>
>> Begin forwarded message:
>>
>>> From: "Rick Cox" <ri...@gmail.com>
>>> Date: May 14, 2008 9:01:31 AM PDT
>>> To: core-user@hadoop.apache.org, apankov@iponweb.net
>>> Subject: Re: Streaming and subprocess error code
>>> Reply-To: core-user@hadoop.apache.org
>>>
>>> Does the syslog output from a should-have-failed task contain
>>> something like this?
>>>
>>>   java.lang.RuntimeException: PipeMapRed.waitOutputThreads():
>>> subprocess failed with code 1
>>>
>>> (In particular, I'm curious if it mentions the RuntimeException.)
>>>
>>> Tasks that consume all their input and then exit non-zero are
>>> definitely supposed to be counted as failed, so there's either a
>>> problem with the setup or a bug somewhere.
>>>
>>> rick
>>>
>>> On Wed, May 14, 2008 at 8:49 PM, Andrey Pankov  
>>> <ap...@iponweb.net> wrote:
>>>> Hi,
>>>>
>>>> I've tested this new option "-jobconf
>>>> stream.non.zero.exit.status.is.failure=true". Seems working but  
>>>> still not
>>>> good for me. When mapper/reducer program have read all input data
>>>> successfully and fails after that, streaming still finishes  
>>>> successfully so
>>>> there are no chances to know about some data post-processing  
>>>> errors in
>>>> subprocesses :(
>>>>
>>>>
>>>>
>>>> Andrey Pankov wrote:
>>>>
>>>>>
>>
>

Re: Reduce not completing

Posted by Amareshwari Sriramadasu <am...@yahoo-inc.com>.

You can report status from streaming job by emitting 
reporter:status:<message>  in stderr.
See documentation @ 
http://hadoop.apache.org/core/docs/r0.18.2/streaming.html#How+do+I+update+status+in+streaming+applications%3F

But from the exception trace, it doesn't look like lack of 
report(timeout). The trace tells that reducer jvm process exited with 
exit-code 1.
It's mostly a bug in reducer code. What is the configuration value of 
the property "stream.non.zero.exit.status.is.failure" ?

Thanks
Amareshwari
Rick Hangartner wrote:
> Hi,
>
> We seem to be seeing an runtime exception in the Reduce phase of a 
> streaming Map-Reduce that has been mentioned before on this list.
>
> http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200805.mbox/%3C482BF75C.9030207@iponweb.net%3E 
>
>
> When I Google the exception, the only thing returned is to this one 
> short thread on the mailing list.  Unfortunately, we don't quite 
> understand the exception message in our current situation or the 
> eventual explanation and resolution of that previous case.
>
> We have tested that the Python script run in the Reduce phase runs 
> without problems.  It returns the correct results when run from the 
> command line fed from stdin by a file that is the output of the map 
> phase for a small map-reduce job that fails in this way.
>
> Here's the exception we are seeing from the jobtracker log:
>
> 2008-12-22 18:13:36,415 INFO org.apache.hadoop.mapred.JobInProgress: 
> Task 'attempt_200812221742_0004_m_000009_0' has completed 
> task_200812221742_0004_m_000009 successfully.
> 2008-12-22 18:13:50,607 INFO org.apache.hadoop.mapred.TaskInProgress: 
> Error from attempt_200812221742_0004_r_000000_0: 
> java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess 
> failed with code 1
>     at 
> org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:301) 
>
>     at 
> org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:518) 
>
>     at 
> org.apache.hadoop.streaming.PipeReducer.reduce(PipeReducer.java:102)
>     at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318)
>     at 
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
>
> 2008-12-22 18:13:52,045 INFO org.apache.hadoop.mapred.JobTracker: 
> Removed completed task 'attempt_200812221742_0004_r_000000_0' from 
> 'tracker_hnode3.cor.mystrands.in:localhost/127.0.0.1:37777'
> 2008-12-22 18:13:52,175 INFO org.apache.hadoop.mapred.JobTracker: 
> Adding task 'attempt_200812221742_0004_r_000000_1' to tip 
> task_200812221742_0004_r_000000, for tracker 
> 'tracker_hnode5.cor.mystrands.in:localhost/127.0.0.1:55254'
>
> We typically see 4 repetitions of this exception in the log. And we 
> may see 1-2 sets of those repetitions.
>
> If someone could explain what this exception actually means, and 
> perhaps what we might need to change in our configuration to fix it, 
> we would be most appreciative.   Naively, it almost seems if a task is 
> just taking slightly too long to complete and report that fact, 
> perhaps because of other Hadoop or MR processes going on at the same 
> time.  If we re-run this map-reduce, it does sometimes run to a 
> successful completion without an exception.
>
> We are just testing map-reduce as a candidate for a number of data 
> reduction tasks right now.  We are running Hadoop 18.1 on a cluster of 
> 9 retired desktop machines that just have 100Mb networking and about 
> 2GB of RAM each, so that's why we are suspecting this could just be a 
> problem that tasks are taking slightly too long to report back they 
> have completed, rather than an actual bug.   (We will be upgrading 
> this test  cluster to Hadoop 19.x and 1Gb networking very shortly.)
>
> Thanks,
> RDH
>
> Begin forwarded message:
>
>> From: "Rick Cox" <ri...@gmail.com>
>> Date: May 14, 2008 9:01:31 AM PDT
>> To: core-user@hadoop.apache.org, apankov@iponweb.net
>> Subject: Re: Streaming and subprocess error code
>> Reply-To: core-user@hadoop.apache.org
>>
>> Does the syslog output from a should-have-failed task contain
>> something like this?
>>
>>    java.lang.RuntimeException: PipeMapRed.waitOutputThreads():
>> subprocess failed with code 1
>>
>> (In particular, I'm curious if it mentions the RuntimeException.)
>>
>> Tasks that consume all their input and then exit non-zero are
>> definitely supposed to be counted as failed, so there's either a
>> problem with the setup or a bug somewhere.
>>
>> rick
>>
>> On Wed, May 14, 2008 at 8:49 PM, Andrey Pankov <ap...@iponweb.net> 
>> wrote:
>>> Hi,
>>>
>>> I've tested this new option "-jobconf
>>> stream.non.zero.exit.status.is.failure=true". Seems working but 
>>> still not
>>> good for me. When mapper/reducer program have read all input data
>>> successfully and fails after that, streaming still finishes 
>>> successfully so
>>> there are no chances to know about some data post-processing errors in
>>> subprocesses :(
>>>
>>>
>>>
>>> Andrey Pankov wrote:
>>>
>>>>
>

Reduce not completing

Posted by Rick Hangartner <ha...@strands.com>.

Hi,

We seem to be seeing an runtime exception in the Reduce phase of a  
streaming Map-Reduce that has been mentioned before on this list.

http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200805.mbox/%3C482BF75C.9030207@iponweb.net%3E

When I Google the exception, the only thing returned is to this one  
short thread on the mailing list.  Unfortunately, we don't quite  
understand the exception message in our current situation or the  
eventual explanation and resolution of that previous case.

We have tested that the Python script run in the Reduce phase runs  
without problems.  It returns the correct results when run from the  
command line fed from stdin by a file that is the output of the map  
phase for a small map-reduce job that fails in this way.

Here's the exception we are seeing from the jobtracker log:

2008-12-22 18:13:36,415 INFO org.apache.hadoop.mapred.JobInProgress:  
Task 'attempt_200812221742_0004_m_000009_0' has completed  
task_200812221742_0004_m_000009 successfully.
2008-12-22 18:13:50,607 INFO org.apache.hadoop.mapred.TaskInProgress:  
Error from attempt_200812221742_0004_r_000000_0:  
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess  
failed with code 1
	at  
org 
.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java: 
301)
	at  
org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java: 
518)
	at org.apache.hadoop.streaming.PipeReducer.reduce(PipeReducer.java:102)
	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318)
	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java: 
2207)

2008-12-22 18:13:52,045 INFO org.apache.hadoop.mapred.JobTracker:  
Removed completed task 'attempt_200812221742_0004_r_000000_0' from  
'tracker_hnode3.cor.mystrands.in:localhost/127.0.0.1:37777'
2008-12-22 18:13:52,175 INFO org.apache.hadoop.mapred.JobTracker:  
Adding task 'attempt_200812221742_0004_r_000000_1' to tip  
task_200812221742_0004_r_000000, for tracker  
'tracker_hnode5.cor.mystrands.in:localhost/127.0.0.1:55254'

We typically see 4 repetitions of this exception in the log. And we  
may see 1-2 sets of those repetitions.

If someone could explain what this exception actually means, and  
perhaps what we might need to change in our configuration to fix it,  
we would be most appreciative.   Naively, it almost seems if a task is  
just taking slightly too long to complete and report that fact,  
perhaps because of other Hadoop or MR processes going on at the same  
time.  If we re-run this map-reduce, it does sometimes run to a  
successful completion without an exception.

We are just testing map-reduce as a candidate for a number of data  
reduction tasks right now.  We are running Hadoop 18.1 on a cluster of  
9 retired desktop machines that just have 100Mb networking and about  
2GB of RAM each, so that's why we are suspecting this could just be a  
problem that tasks are taking slightly too long to report back they  
have completed, rather than an actual bug.   (We will be upgrading  
this test  cluster to Hadoop 19.x and 1Gb networking very shortly.)

Thanks,
RDH

Begin forwarded message:

> From: "Rick Cox" <ri...@gmail.com>
> Date: May 14, 2008 9:01:31 AM PDT
> To: core-user@hadoop.apache.org, apankov@iponweb.net
> Subject: Re: Streaming and subprocess error code
> Reply-To: core-user@hadoop.apache.org
>
> Does the syslog output from a should-have-failed task contain
> something like this?
>
>    java.lang.RuntimeException: PipeMapRed.waitOutputThreads():
> subprocess failed with code 1
>
> (In particular, I'm curious if it mentions the RuntimeException.)
>
> Tasks that consume all their input and then exit non-zero are
> definitely supposed to be counted as failed, so there's either a
> problem with the setup or a bug somewhere.
>
> rick
>
> On Wed, May 14, 2008 at 8:49 PM, Andrey Pankov <ap...@iponweb.net>  
> wrote:
>> Hi,
>>
>> I've tested this new option "-jobconf
>> stream.non.zero.exit.status.is.failure=true". Seems working but  
>> still not
>> good for me. When mapper/reducer program have read all input data
>> successfully and fails after that, streaming still finishes  
>> successfully so
>> there are no chances to know about some data post-processing errors  
>> in
>> subprocesses :(
>>
>>
>>
>> Andrey Pankov wrote:
>>
>>>

Re: Streaming and subprocess error code

Posted by Andrey Pankov <ap...@iponweb.net>.

Hi Rick,

Double checked my test. The syslog output contains msg about non-zero 
exit code (in this case mapper finished with segfault)

2008-05-14 18:12:04,473 INFO org.apache.hadoop.streaming.PipeMapRed: 
PipeMapRed.waitOutputThreads(): subprocess exited with code 134 in 
org.apache.hadoop.streaming.PipeMapRed

stderr contains message with dump or smth about segfault.

Reducer job also finished with error:

2008-05-14 20:28:34,128 INFO org.apache.hadoop.streaming.PipeMapRed: 
PipeMapRed.waitOutputThreads(): subprocess exited with code 55 in 
org.apache.hadoop.streaming.PipeMapRed

Hence entire job is successful

08/05/14 18:12:03 INFO streaming.StreamJob:  map 0%  reduce 0%
08/05/14 18:12:05 INFO streaming.StreamJob:  map 100%  reduce 0%
08/05/14 18:12:06 INFO streaming.StreamJob:  map 100%  reduce 100%
08/05/14 18:12:06 INFO streaming.StreamJob: Job complete: 
job_200805131958_0020
08/05/14 18:12:06 INFO streaming.StreamJob: Output: 
/user/hadoop/data1_result




Rick Cox wrote:
> Hi,
> 
> Thanks: that message indicates the stream.non.zero.exit.is.failure
> feature isn't enabled for this task; the log is just reporting the
> exit status, but not raising the RuntimeException that it would if the
> feature were turned on.
> 
> I've had problems getting this parameter through from the command line
> before. If you've got access, you could try setting it in the
> hadoop-site.xml instead (I think it should be the tasktrackers that
> read that parameter).
> 
> (Sorry about the confusion here, we've been using that patch for so
> long I had forgotten it wasn't yet released, and I'm not exactly sure
> where we stand with these other bugs.)
> 
> rick
> 
> On Wed, May 14, 2008 at 11:05 PM, Andrey Pankov <ap...@iponweb.net> wrote:
>> Hi Rick,
>>
>>  Double checked my test. The syslog output contains msg about non-zero exit
>> code (in this case mapper finished with segfault)
>>
>>  2008-05-14 18:12:04,473 INFO org.apache.hadoop.streaming.PipeMapRed:
>> PipeMapRed.waitOutputThreads(): subprocess exited with code 134 in
>> org.apache.hadoop.streaming.PipeMapRed
>>
>>  stderr contains message with dump or smth about segfault.
>>
>>  Reducer job also finished with error:
>>
>>  2008-05-14 20:28:34,128 INFO org.apache.hadoop.streaming.PipeMapRed:
>> PipeMapRed.waitOutputThreads(): subprocess exited with code 55 in
>> org.apache.hadoop.streaming.PipeMapRed
>>
>>  Hence entire job is successful
>>
>>  08/05/14 18:12:03 INFO streaming.StreamJob:  map 0%  reduce 0%
>>  08/05/14 18:12:05 INFO streaming.StreamJob:  map 100%  reduce 0%
>>  08/05/14 18:12:06 INFO streaming.StreamJob:  map 100%  reduce 100%
>>  08/05/14 18:12:06 INFO streaming.StreamJob: Job complete:
>> job_200805131958_0020
>>  08/05/14 18:12:06 INFO streaming.StreamJob: Output:
>> /user/hadoop/data1_result
>>
>>
>>
>>
>>
>>
>>  Rick Cox wrote:
>>
>>> Does the syslog output from a should-have-failed task contain
>>> something like this?
>>>
>>>    java.lang.RuntimeException: PipeMapRed.waitOutputThreads():
>>> subprocess failed with code 1
>>>
>>> (In particular, I'm curious if it mentions the RuntimeException.)
>>>
>>> Tasks that consume all their input and then exit non-zero are
>>> definitely supposed to be counted as failed, so there's either a
>>> problem with the setup or a bug somewhere.
>>>
>>> rick
>>>
>>> On Wed, May 14, 2008 at 8:49 PM, Andrey Pankov <ap...@iponweb.net>
>> wrote:
>>>> Hi,
>>>>
>>>>  I've tested this new option "-jobconf
>>>> stream.non.zero.exit.status.is.failure=true". Seems working but still
>> not
>>>> good for me. When mapper/reducer program have read all input data
>>>> successfully and fails after that, streaming still finishes successfully
>> so
>>>> there are no chances to know about some data post-processing errors in
>>>> subprocesses :(
>>>>
>>>>
>>>>
>>>>  Andrey Pankov wrote:
>>>>
>>>>
>>>>> Hi Rick,
>>>>>
>>>>> Thank you for the quick response! I see this feature is in trunk and
>> not
>>>> available in last stable release. Anyway will try if it works for me
>> from
>>>> the trunk, and will try does it catch segmentation faults too.
>>>>
>>>>> Rick Cox wrote:
>>>>>
>>>>>
>>>>>> Try "-jobconf stream.non.zero.exit.status.is.failure=true".
>>>>>>
>>>>>> That will tell streaming that a non-zero exit is a task failure. To
>>>>>> turn that into an immediate whole job failure, I think configuring 0
>>>>>> task retries (mapred.map.max.attempts=1 and
>>>>>> mapred.reduce.max.attempts=1) will be sufficient.
>>>>>>
>>>>>> rick
>>>>>>
>>>>>> On Tue, May 13, 2008 at 8:15 PM, Andrey Pankov <ap...@iponweb.net>
>>>>>>
>>>> wrote:
>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>>  I'm looking a way to force Streaming to shutdown the whole job in
>>>>>>>
>>>> case when
>>>>
>>>>>>> some of its subprocesses exits with non-zero error code.
>>>>>>>
>>>>>>>  We have next situation. Sometimes either mapper or reducer could
>>>>>>>
>>>> crush, as
>>>>
>>>>>>> a rule it returns some exit code. In this case entire streaming
>> job
>>>> finishes
>>>>
>>>>>>> successfully, but that's wrong. Almost the same when any
>> subprocess
>>>> finishes
>>>>
>>>>>>> with segmentation fault.
>>>>>>>
>>>>>>>  It's possible to check automatically if a subprocess crushed only
>> via
>>>> logs
>>>>
>>>>>>> but it means you need to parse tons of outputs/logs/dirs/etc.
>>>>>>>  In order to find logs of your job you have to know it's jobid ~
>>>>>>> job_200805130853_0016. I don't know easy way to determine it -
>> just
>>>> scan
>>>>
>>>>>>> stdout for the pattern. Then find logs of each mapper, each
>> reducer,
>>>> find a
>>>>
>>>>>>> way to parse them, etc, etc...
>>>>>>>
>>>>>>>  So, is there any easiest way get correct status of the whole
>>>>>>>
>>>> streaming job
>>>>
>>>>>>> or I still have to build rather fragile parsing systems for such
>>>>>>>
>>>> purposes?
>>>>
>>>>>>>  Thanks in advance.
>>>>>>>
>>>>>>>  --
>>>>>>>  Andrey Pankov
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>  --
>>>>  Andrey Pankov
>>>>
>>>>
>>>>
>>>
>>
>>  --
>>  Andrey Pankov
>>
>>
> 


-- 
Andrey Pankov

Re: Streaming and subprocess error code

Posted by Rick Cox <ri...@gmail.com>.

Does the syslog output from a should-have-failed task contain
something like this?

    java.lang.RuntimeException: PipeMapRed.waitOutputThreads():
subprocess failed with code 1

(In particular, I'm curious if it mentions the RuntimeException.)

Tasks that consume all their input and then exit non-zero are
definitely supposed to be counted as failed, so there's either a
problem with the setup or a bug somewhere.

rick

On Wed, May 14, 2008 at 8:49 PM, Andrey Pankov <ap...@iponweb.net> wrote:
> Hi,
>
>  I've tested this new option "-jobconf
> stream.non.zero.exit.status.is.failure=true". Seems working but still not
> good for me. When mapper/reducer program have read all input data
> successfully and fails after that, streaming still finishes successfully so
> there are no chances to know about some data post-processing errors in
> subprocesses :(
>
>
>
>  Andrey Pankov wrote:
>
> > Hi Rick,
> >
> > Thank you for the quick response! I see this feature is in trunk and not
> available in last stable release. Anyway will try if it works for me from
> the trunk, and will try does it catch segmentation faults too.
> >
> >
> > Rick Cox wrote:
> >
> > > Try "-jobconf stream.non.zero.exit.status.is.failure=true".
> > >
> > > That will tell streaming that a non-zero exit is a task failure. To
> > > turn that into an immediate whole job failure, I think configuring 0
> > > task retries (mapred.map.max.attempts=1 and
> > > mapred.reduce.max.attempts=1) will be sufficient.
> > >
> > > rick
> > >
> > > On Tue, May 13, 2008 at 8:15 PM, Andrey Pankov <ap...@iponweb.net>
> wrote:
> > >
> > > > Hi all,
> > > >
> > > >  I'm looking a way to force Streaming to shutdown the whole job in
> case when
> > > > some of its subprocesses exits with non-zero error code.
> > > >
> > > >  We have next situation. Sometimes either mapper or reducer could
> crush, as
> > > > a rule it returns some exit code. In this case entire streaming job
> finishes
> > > > successfully, but that's wrong. Almost the same when any subprocess
> finishes
> > > > with segmentation fault.
> > > >
> > > >  It's possible to check automatically if a subprocess crushed only via
> logs
> > > > but it means you need to parse tons of outputs/logs/dirs/etc.
> > > >  In order to find logs of your job you have to know it's jobid ~
> > > > job_200805130853_0016. I don't know easy way to determine it - just
> scan
> > > > stdout for the pattern. Then find logs of each mapper, each reducer,
> find a
> > > > way to parse them, etc, etc...
> > > >
> > > >  So, is there any easiest way get correct status of the whole
> streaming job
> > > > or I still have to build rather fragile parsing systems for such
> purposes?
> > > >
> > > >  Thanks in advance.
> > > >
> > > >  --
> > > >  Andrey Pankov
> > > >
> > > >
> > > >
> > >
> > >
> >
> >
> >
>
>
>  --
>  Andrey Pankov
>
>

Re: Streaming and subprocess error code

Posted by Andrey Pankov <ap...@iponweb.net>.

Hi,

I've tested this new option "-jobconf 
stream.non.zero.exit.status.is.failure=true". Seems working but still 
not good for me. When mapper/reducer program have read all input data 
successfully and fails after that, streaming still finishes successfully 
so there are no chances to know about some data post-processing errors 
in subprocesses :(

Andrey Pankov wrote:
> Hi Rick,
> 
> Thank you for the quick response! I see this feature is in trunk and not 
> available in last stable release. Anyway will try if it works for me 
> from the trunk, and will try does it catch segmentation faults too.
> 
> 
> Rick Cox wrote:
>> Try "-jobconf stream.non.zero.exit.status.is.failure=true".
>>
>> That will tell streaming that a non-zero exit is a task failure. To
>> turn that into an immediate whole job failure, I think configuring 0
>> task retries (mapred.map.max.attempts=1 and
>> mapred.reduce.max.attempts=1) will be sufficient.
>>
>> rick
>>
>> On Tue, May 13, 2008 at 8:15 PM, Andrey Pankov <ap...@iponweb.net> 
>> wrote:
>>> Hi all,
>>>
>>>  I'm looking a way to force Streaming to shutdown the whole job in 
>>> case when
>>> some of its subprocesses exits with non-zero error code.
>>>
>>>  We have next situation. Sometimes either mapper or reducer could 
>>> crush, as
>>> a rule it returns some exit code. In this case entire streaming job 
>>> finishes
>>> successfully, but that's wrong. Almost the same when any subprocess 
>>> finishes
>>> with segmentation fault.
>>>
>>>  It's possible to check automatically if a subprocess crushed only 
>>> via logs
>>> but it means you need to parse tons of outputs/logs/dirs/etc.
>>>  In order to find logs of your job you have to know it's jobid ~
>>> job_200805130853_0016. I don't know easy way to determine it - just scan
>>> stdout for the pattern. Then find logs of each mapper, each reducer, 
>>> find a
>>> way to parse them, etc, etc...
>>>
>>>  So, is there any easiest way get correct status of the whole 
>>> streaming job
>>> or I still have to build rather fragile parsing systems for such 
>>> purposes?
>>>
>>>  Thanks in advance.
>>>
>>>  --
>>>  Andrey Pankov
>>>
>>>
>>
> 
> 


-- 
Andrey Pankov

Re: Streaming and subprocess error code

Posted by Andrey Pankov <ap...@iponweb.net>.

Hi Rick,

Thank you for the quick response! I see this feature is in trunk and not 
available in last stable release. Anyway will try if it works for me 
from the trunk, and will try does it catch segmentation faults too.


Rick Cox wrote:
> Try "-jobconf stream.non.zero.exit.status.is.failure=true".
> 
> That will tell streaming that a non-zero exit is a task failure. To
> turn that into an immediate whole job failure, I think configuring 0
> task retries (mapred.map.max.attempts=1 and
> mapred.reduce.max.attempts=1) will be sufficient.
> 
> rick
> 
> On Tue, May 13, 2008 at 8:15 PM, Andrey Pankov <ap...@iponweb.net> wrote:
>> Hi all,
>>
>>  I'm looking a way to force Streaming to shutdown the whole job in case when
>> some of its subprocesses exits with non-zero error code.
>>
>>  We have next situation. Sometimes either mapper or reducer could crush, as
>> a rule it returns some exit code. In this case entire streaming job finishes
>> successfully, but that's wrong. Almost the same when any subprocess finishes
>> with segmentation fault.
>>
>>  It's possible to check automatically if a subprocess crushed only via logs
>> but it means you need to parse tons of outputs/logs/dirs/etc.
>>  In order to find logs of your job you have to know it's jobid ~
>> job_200805130853_0016. I don't know easy way to determine it - just scan
>> stdout for the pattern. Then find logs of each mapper, each reducer, find a
>> way to parse them, etc, etc...
>>
>>  So, is there any easiest way get correct status of the whole streaming job
>> or I still have to build rather fragile parsing systems for such purposes?
>>
>>  Thanks in advance.
>>
>>  --
>>  Andrey Pankov
>>
>>
> 


-- 
Andrey Pankov

Re: Streaming and subprocess error code

Posted by Arun C Murthy <ar...@yahoo-inc.com>.

On May 13, 2008, at 8:09 AM, Rick Cox wrote:

> Try "-jobconf stream.non.zero.exit.status.is.failure=true".
>

Anyone willing to document this on http://hadoop.apache.org/core/docs/ 
current/streaming.html?
Ideally HADOOP-2057 should have documented this useful feature,  
better late than never... I've opened
http://issues.apache.org/jira/browse/HADOOP-3379 for this  
documentation request.

Arun

> That will tell streaming that a non-zero exit is a task failure. To
> turn that into an immediate whole job failure, I think configuring 0
> task retries (mapred.map.max.attempts=1 and
> mapred.reduce.max.attempts=1) will be sufficient.
>
> rick
>
> On Tue, May 13, 2008 at 8:15 PM, Andrey Pankov  
> <ap...@iponweb.net> wrote:
>> Hi all,
>>
>>  I'm looking a way to force Streaming to shutdown the whole job in  
>> case when
>> some of its subprocesses exits with non-zero error code.
>>
>>  We have next situation. Sometimes either mapper or reducer could  
>> crush, as
>> a rule it returns some exit code. In this case entire streaming  
>> job finishes
>> successfully, but that's wrong. Almost the same when any  
>> subprocess finishes
>> with segmentation fault.
>>
>>  It's possible to check automatically if a subprocess crushed only  
>> via logs
>> but it means you need to parse tons of outputs/logs/dirs/etc.
>>  In order to find logs of your job you have to know it's jobid ~
>> job_200805130853_0016. I don't know easy way to determine it -  
>> just scan
>> stdout for the pattern. Then find logs of each mapper, each  
>> reducer, find a
>> way to parse them, etc, etc...
>>
>>  So, is there any easiest way get correct status of the whole  
>> streaming job
>> or I still have to build rather fragile parsing systems for such  
>> purposes?
>>
>>  Thanks in advance.
>>
>>  --
>>  Andrey Pankov
>>
>>

Re: Streaming and subprocess error code

Posted by Rick Cox <ri...@gmail.com>.

Try "-jobconf stream.non.zero.exit.status.is.failure=true".

That will tell streaming that a non-zero exit is a task failure. To
turn that into an immediate whole job failure, I think configuring 0
task retries (mapred.map.max.attempts=1 and
mapred.reduce.max.attempts=1) will be sufficient.

rick

On Tue, May 13, 2008 at 8:15 PM, Andrey Pankov <ap...@iponweb.net> wrote:
> Hi all,
>
>  I'm looking a way to force Streaming to shutdown the whole job in case when
> some of its subprocesses exits with non-zero error code.
>
>  We have next situation. Sometimes either mapper or reducer could crush, as
> a rule it returns some exit code. In this case entire streaming job finishes
> successfully, but that's wrong. Almost the same when any subprocess finishes
> with segmentation fault.
>
>  It's possible to check automatically if a subprocess crushed only via logs
> but it means you need to parse tons of outputs/logs/dirs/etc.
>  In order to find logs of your job you have to know it's jobid ~
> job_200805130853_0016. I don't know easy way to determine it - just scan
> stdout for the pattern. Then find logs of each mapper, each reducer, find a
> way to parse them, etc, etc...
>
>  So, is there any easiest way get correct status of the whole streaming job
> or I still have to build rather fragile parsing systems for such purposes?
>
>  Thanks in advance.
>
>  --
>  Andrey Pankov
>
>