You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Justin Woody <ju...@gmail.com> on 2011/10/12 18:36:45 UTC

Skipping Bad Records

Can anyone confirm whether the skip options work for MR jobs using the
new API? I have a job using the new API and I cannot get the job to
skip corrupted records. I tried configuring job properties manually
and using the SkipBadRecords class.

Thanks,
Justin

Re: Skipping Bad Records

Posted by Justin Woody <ju...@gmail.com>.

Tom,

Agreed, this is a third party reader operating on a custom data
format. Neither of which I control. The error is happening in the
reader and I'm trying to isolate the issue in order to do proper
handling.

Thanks!
Justin

On Thu, Oct 13, 2011 at 5:31 PM, Tom White <to...@cloudera.com> wrote:
> Justin,
>
> The skipping feature should really only be used when you are calling
> out to a third-party library that may segfault on corrupt data, and
> even then it's probably better to use a subprocess to handles it, as
> Owen suggested here:
> http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201108.mbox/%3cCAFQoU9Ekv+SBvAv-bSF5dORJO68VSj6zTqXywWUT+qHS3V3bbA@mail.gmail.com%3e.
>
> In other cases you should handle the corrupt data in your mapper or
> reducer, by catching the relevant exception, for example.
>
> Tom
>
> On Thu, Oct 13, 2011 at 5:41 AM, Justin Woody <ju...@gmail.com> wrote:
>> Harsh,
>>
>> Thanks for the info. If I get some time maybe I can assist. I'm
>> looking over your code now. For now I am failing the files with the
>> mapred.max.map.failures.percent property, but I'm losing a lot of good
>> data going that route.
>>
>> Justin
>>
>> On Wed, Oct 12, 2011 at 4:27 PM, Harsh J <ha...@cloudera.com> wrote:
>>> Justin,
>>>
>>> Unfortunately not. The new API does not have a skipping feature yet
>>> like the older one.
>>>
>>> I did get started on some work on
>>> https://issues.apache.org/jira/browse/MAPREDUCE-1932 to fix this but I
>>> haven't been able to find time to complete it with proper tests and
>>> such. I'll try to do it within a week from now.
>>>
>>> On Wed, Oct 12, 2011 at 10:06 PM, Justin Woody <ju...@gmail.com> wrote:
>>>> Can anyone confirm whether the skip options work for MR jobs using the
>>>> new API? I have a job using the new API and I cannot get the job to
>>>> skip corrupted records. I tried configuring job properties manually
>>>> and using the SkipBadRecords class.
>>>>
>>>> Thanks,
>>>> Justin
>>>>
>>>
>>>
>>>
>>> --
>>> Harsh J
>>>
>>
>

Re: Skipping Bad Records

Posted by Tom White <to...@cloudera.com>.

Justin,

The skipping feature should really only be used when you are calling
out to a third-party library that may segfault on corrupt data, and
even then it's probably better to use a subprocess to handles it, as
Owen suggested here:
http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201108.mbox/%3cCAFQoU9Ekv+SBvAv-bSF5dORJO68VSj6zTqXywWUT+qHS3V3bbA@mail.gmail.com%3e.

In other cases you should handle the corrupt data in your mapper or
reducer, by catching the relevant exception, for example.

Tom

On Thu, Oct 13, 2011 at 5:41 AM, Justin Woody <ju...@gmail.com> wrote:
> Harsh,
>
> Thanks for the info. If I get some time maybe I can assist. I'm
> looking over your code now. For now I am failing the files with the
> mapred.max.map.failures.percent property, but I'm losing a lot of good
> data going that route.
>
> Justin
>
> On Wed, Oct 12, 2011 at 4:27 PM, Harsh J <ha...@cloudera.com> wrote:
>> Justin,
>>
>> Unfortunately not. The new API does not have a skipping feature yet
>> like the older one.
>>
>> I did get started on some work on
>> https://issues.apache.org/jira/browse/MAPREDUCE-1932 to fix this but I
>> haven't been able to find time to complete it with proper tests and
>> such. I'll try to do it within a week from now.
>>
>> On Wed, Oct 12, 2011 at 10:06 PM, Justin Woody <ju...@gmail.com> wrote:
>>> Can anyone confirm whether the skip options work for MR jobs using the
>>> new API? I have a job using the new API and I cannot get the job to
>>> skip corrupted records. I tried configuring job properties manually
>>> and using the SkipBadRecords class.
>>>
>>> Thanks,
>>> Justin
>>>
>>
>>
>>
>> --
>> Harsh J
>>
>

Re: Skipping Bad Records

Posted by Justin Woody <ju...@gmail.com>.

Harsh,

Thanks for the info. If I get some time maybe I can assist. I'm
looking over your code now. For now I am failing the files with the
mapred.max.map.failures.percent property, but I'm losing a lot of good
data going that route.

Justin

On Wed, Oct 12, 2011 at 4:27 PM, Harsh J <ha...@cloudera.com> wrote:
> Justin,
>
> Unfortunately not. The new API does not have a skipping feature yet
> like the older one.
>
> I did get started on some work on
> https://issues.apache.org/jira/browse/MAPREDUCE-1932 to fix this but I
> haven't been able to find time to complete it with proper tests and
> such. I'll try to do it within a week from now.
>
> On Wed, Oct 12, 2011 at 10:06 PM, Justin Woody <ju...@gmail.com> wrote:
>> Can anyone confirm whether the skip options work for MR jobs using the
>> new API? I have a job using the new API and I cannot get the job to
>> skip corrupted records. I tried configuring job properties manually
>> and using the SkipBadRecords class.
>>
>> Thanks,
>> Justin
>>
>
>
>
> --
> Harsh J
>

Re: Skipping Bad Records

Posted by Harsh J <ha...@cloudera.com>.

Justin,

Unfortunately not. The new API does not have a skipping feature yet
like the older one.

I did get started on some work on
https://issues.apache.org/jira/browse/MAPREDUCE-1932 to fix this but I
haven't been able to find time to complete it with proper tests and
such. I'll try to do it within a week from now.

On Wed, Oct 12, 2011 at 10:06 PM, Justin Woody <ju...@gmail.com> wrote:
> Can anyone confirm whether the skip options work for MR jobs using the
> new API? I have a job using the new API and I cannot get the job to
> skip corrupted records. I tried configuring job properties manually
> and using the SkipBadRecords class.
>
> Thanks,
> Justin
>

-- 
Harsh J