You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-dev@hadoop.apache.org by Pillis W <pi...@gmail.com> on 2017/04/13 20:33:48 UTC

Skip bad records when streaming supported?

Hello,
I am using 'hadoop-streaming.jar' to do a simple word count, and want to
skip records that fail execution. Below is the actual command I run, and
the mapper always fails on one record, and hence fails the job. The input
file is 3 lines with 1 bad line.

hadoop jar /usr/lib/hadoop/hadoop-streaming.jar -D mapred.job.name=SkipTest
-Dmapreduce.task.skip.start.attempts=1 -Dmapreduce.map.skip.maxrecords=1
-Dmapreduce.reduce.skip.maxgroups=1
-Dmapreduce.map.skip.proc.count.autoincr=false
-Dmapreduce.reduce.skip.proc.count.autoincr=false -D mapred.reduce.tasks=1
-D mapred.map.tasks=1 -files
/home/hadoop/wc/wordcount-mapper-t1.py,/home/hadoop/wc/wordcount-reducer-t1.py
-input /user/hadoop/data/test1 -output /user/hadoop/data/output-test-5
-mapper "python wordcount-mapper-t1.py" -reducer "python
wordcount-reducer-t1.py"


I was wondering if skipping of records is supported when MapReduce is used
in streaming mode?

Thanks in advance.
PW

Re: Skip bad records when streaming supported?

Posted by Daniel Templeton <da...@cloudera.com>.
To quote the docs:

---
This feature can be used when map/reduce tasks crashes deterministically 
on certain input. This happens due to bugs in the map/reduce function. 
The usual course would be to fix these bugs. But sometimes this is not 
possible; perhaps the bug is in third party libraries for which the 
source code is not available. Due to this, the task never reaches to 
completion even with multiple attempts and complete data for that task 
is lost.

With this feature, only a small portion of data is lost surrounding the 
bad record, which may be acceptable for some user applications. see 
setMapperMaxSkipRecords(Configuration, long)
---

Basically, it's a heavy-handed approach that you should only use as a 
last resort.

Daniel


On 4/13/17 3:24 PM, Pillis W wrote:
> Thanks Daniel.
>
> Please correct me if I have understood this incorrectly, but according 
> to the documentation at 
> http://hadoop.apache.org/docs/r2.7.3/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Skipping_Bad_Records 
> , it seemed like the sole purpose of this functionality is to tolerate 
> unknown failures/exceptions in mappers/reducers. If I was able to 
> catch all failures, I do not need to even use this ability - is that 
> not true?
>
> If I have understood it incorrectly, when would one use the feature to 
> skip bad records?
>
> Regards,
> PW
>
>
>
>
> On Thu, Apr 13, 2017 at 2:49 PM, Daniel Templeton <daniel@cloudera.com 
> <ma...@cloudera.com>> wrote:
>
>     You have to modify wordcount-mapper-t1.py to just ignore the bad
>     line.  In the worst case, you should be able to do something like:
>
>     for line in sys.stdin:
>       try:
>         # Insert processing code here
>       except:
>         # Error processing record, ignore it
>         pass
>
>     Daniel
>
>
>     On 4/13/17 1:33 PM, Pillis W wrote:
>
>         Hello,
>         I am using 'hadoop-streaming.jar' to do a simple word count,
>         and want to
>         skip records that fail execution. Below is the actual command
>         I run, and
>         the mapper always fails on one record, and hence fails the
>         job. The input
>         file is 3 lines with 1 bad line.
>
>         hadoop jar /usr/lib/hadoop/hadoop-streaming.jar -D
>         mapred.job.name <http://mapred.job.name>=SkipTest
>         -Dmapreduce.task.skip.start.at
>         <http://Dmapreduce.task.skip.start.at>tempts=1
>         -Dmapreduce.map.skip.maxrecords=1
>         -Dmapreduce.reduce.skip.maxgroups=1
>         -Dmapreduce.map.skip.proc.count.autoincr=false
>         -Dmapreduce.reduce.skip.proc.count.autoincr=false -D
>         mapred.reduce.tasks=1
>         -D mapred.map.tasks=1 -files
>         /home/hadoop/wc/wordcount-mapper-t1.py,/home/hadoop/wc/wordcount-reducer-t1.py
>         -input /user/hadoop/data/test1 -output
>         /user/hadoop/data/output-test-5
>         -mapper "python wordcount-mapper-t1.py" -reducer "python
>         wordcount-reducer-t1.py"
>
>
>         I was wondering if skipping of records is supported when
>         MapReduce is used
>         in streaming mode?
>
>         Thanks in advance.
>         PW
>
>
>
>     ---------------------------------------------------------------------
>     To unsubscribe, e-mail:
>     mapreduce-dev-unsubscribe@hadoop.apache.org
>     <ma...@hadoop.apache.org>
>     For additional commands, e-mail:
>     mapreduce-dev-help@hadoop.apache.org
>     <ma...@hadoop.apache.org>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-help@hadoop.apache.org


Re: Skip bad records when streaming supported?

Posted by Pillis W <pi...@gmail.com>.
Thanks Daniel.

Please correct me if I have understood this incorrectly, but according to
the documentation at
http://hadoop.apache.org/docs/r2.7.3/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Skipping_Bad_Records
, it seemed like the sole purpose of this functionality is to tolerate
unknown failures/exceptions in mappers/reducers. If I was able to catch all
failures, I do not need to even use this ability - is that not true?

If I have understood it incorrectly, when would one use the feature to skip
bad records?

Regards,
PW




On Thu, Apr 13, 2017 at 2:49 PM, Daniel Templeton <da...@cloudera.com>
wrote:

> You have to modify wordcount-mapper-t1.py to just ignore the bad line.  In
> the worst case, you should be able to do something like:
>
> for line in sys.stdin:
>   try:
>     # Insert processing code here
>   except:
>     # Error processing record, ignore it
>     pass
>
> Daniel
>
>
> On 4/13/17 1:33 PM, Pillis W wrote:
>
>> Hello,
>> I am using 'hadoop-streaming.jar' to do a simple word count, and want to
>> skip records that fail execution. Below is the actual command I run, and
>> the mapper always fails on one record, and hence fails the job. The input
>> file is 3 lines with 1 bad line.
>>
>> hadoop jar /usr/lib/hadoop/hadoop-streaming.jar -D mapred.job.name
>> =SkipTest
>> -Dmapreduce.task.skip.start.attempts=1 -Dmapreduce.map.skip.maxrecords=1
>> -Dmapreduce.reduce.skip.maxgroups=1
>> -Dmapreduce.map.skip.proc.count.autoincr=false
>> -Dmapreduce.reduce.skip.proc.count.autoincr=false -D
>> mapred.reduce.tasks=1
>> -D mapred.map.tasks=1 -files
>> /home/hadoop/wc/wordcount-mapper-t1.py,/home/hadoop/wc/wordc
>> ount-reducer-t1.py
>> -input /user/hadoop/data/test1 -output /user/hadoop/data/output-test-5
>> -mapper "python wordcount-mapper-t1.py" -reducer "python
>> wordcount-reducer-t1.py"
>>
>>
>> I was wondering if skipping of records is supported when MapReduce is used
>> in streaming mode?
>>
>> Thanks in advance.
>> PW
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: mapreduce-dev-unsubscribe@hadoop.apache.org
> For additional commands, e-mail: mapreduce-dev-help@hadoop.apache.org
>
>

Re: Skip bad records when streaming supported?

Posted by Daniel Templeton <da...@cloudera.com>.
You have to modify wordcount-mapper-t1.py to just ignore the bad line.  
In the worst case, you should be able to do something like:

for line in sys.stdin:
   try:
     # Insert processing code here
   except:
     # Error processing record, ignore it
     pass

Daniel

On 4/13/17 1:33 PM, Pillis W wrote:
> Hello,
> I am using 'hadoop-streaming.jar' to do a simple word count, and want to
> skip records that fail execution. Below is the actual command I run, and
> the mapper always fails on one record, and hence fails the job. The input
> file is 3 lines with 1 bad line.
>
> hadoop jar /usr/lib/hadoop/hadoop-streaming.jar -D mapred.job.name=SkipTest
> -Dmapreduce.task.skip.start.attempts=1 -Dmapreduce.map.skip.maxrecords=1
> -Dmapreduce.reduce.skip.maxgroups=1
> -Dmapreduce.map.skip.proc.count.autoincr=false
> -Dmapreduce.reduce.skip.proc.count.autoincr=false -D mapred.reduce.tasks=1
> -D mapred.map.tasks=1 -files
> /home/hadoop/wc/wordcount-mapper-t1.py,/home/hadoop/wc/wordcount-reducer-t1.py
> -input /user/hadoop/data/test1 -output /user/hadoop/data/output-test-5
> -mapper "python wordcount-mapper-t1.py" -reducer "python
> wordcount-reducer-t1.py"
>
>
> I was wondering if skipping of records is supported when MapReduce is used
> in streaming mode?
>
> Thanks in advance.
> PW
>


---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-help@hadoop.apache.org