You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by 小强 <79...@qq.com> on 2013/06/15 08:39:15 UTC

Question about Skip Bad Records

Hi, I found the SkippingRecordReader is no longer supported in the new api and I am curious about the reason, can anyone tell me.


Besides, when I look into the old api and try to figure out what skip mode was doing, I am a little confused about the logic there.
In my comprehension, if java api is used we can always precisely locate which one is the bad record. 
If streaming is used, as long as user can handle the counter correctly (I mean accumulate the counter for each record in), we can also locate the exact bad record. (I wonder if I miss something here)
But if user don't care about the counter it's always a disaster for the framework to locate bad records (even using binary search)


To sum up:
Ques 1:  why skip mode is removed in the new api
Ques 2:  if user handle counter correctly in streaming, can we locate the exact bad record
Ques 3:  when in skip mode, why not locate more bad records by restart the user logic instead of locate one bad record for each task attempt


Thank you!


Dasheng Jiang

Re: Question about Skip Bad Records

Posted by Harsh J <ha...@cloudera.com>.
Hi,

Please see comments in https://issues.apache.org/jira/browse/MAPREDUCE-1932

On Sat, Jun 15, 2013 at 12:09 PM, 小强 <79...@qq.com> wrote:
> Hi, I found the SkippingRecordReader is no longer supported in the new api
> and I am curious about the reason, can anyone tell me.
>
> Besides, when I look into the old api and try to figure out what skip mode
> was doing, I am a little confused about the logic there.
> In my comprehension, if java api is used we can always precisely locate
> which one is the bad record.
> If streaming is used, as long as user can handle the counter correctly (I
> mean accumulate the counter for each record in), we can also locate the
> exact bad record. (I wonder if I miss something here)
> But if user don't care about the counter it's always a disaster for the
> framework to locate bad records (even using binary search)
>
> To sum up:
> Ques 1:  why skip mode is removed in the new api
> Ques 2:  if user handle counter correctly in streaming, can we locate the
> exact bad record
> Ques 3:  when in skip mode, why not locate more bad records by restart the
> user logic instead of locate one bad record for each task attempt
>
> Thank you!
>
> Dasheng Jiang



-- 
Harsh J

Re: Question about Skip Bad Records

Posted by Harsh J <ha...@cloudera.com>.
Hi,

Please see comments in https://issues.apache.org/jira/browse/MAPREDUCE-1932

On Sat, Jun 15, 2013 at 12:09 PM, 小强 <79...@qq.com> wrote:
> Hi, I found the SkippingRecordReader is no longer supported in the new api
> and I am curious about the reason, can anyone tell me.
>
> Besides, when I look into the old api and try to figure out what skip mode
> was doing, I am a little confused about the logic there.
> In my comprehension, if java api is used we can always precisely locate
> which one is the bad record.
> If streaming is used, as long as user can handle the counter correctly (I
> mean accumulate the counter for each record in), we can also locate the
> exact bad record. (I wonder if I miss something here)
> But if user don't care about the counter it's always a disaster for the
> framework to locate bad records (even using binary search)
>
> To sum up:
> Ques 1:  why skip mode is removed in the new api
> Ques 2:  if user handle counter correctly in streaming, can we locate the
> exact bad record
> Ques 3:  when in skip mode, why not locate more bad records by restart the
> user logic instead of locate one bad record for each task attempt
>
> Thank you!
>
> Dasheng Jiang



-- 
Harsh J

Re: Question about Skip Bad Records

Posted by Harsh J <ha...@cloudera.com>.
Hi,

Please see comments in https://issues.apache.org/jira/browse/MAPREDUCE-1932

On Sat, Jun 15, 2013 at 12:09 PM, 小强 <79...@qq.com> wrote:
> Hi, I found the SkippingRecordReader is no longer supported in the new api
> and I am curious about the reason, can anyone tell me.
>
> Besides, when I look into the old api and try to figure out what skip mode
> was doing, I am a little confused about the logic there.
> In my comprehension, if java api is used we can always precisely locate
> which one is the bad record.
> If streaming is used, as long as user can handle the counter correctly (I
> mean accumulate the counter for each record in), we can also locate the
> exact bad record. (I wonder if I miss something here)
> But if user don't care about the counter it's always a disaster for the
> framework to locate bad records (even using binary search)
>
> To sum up:
> Ques 1:  why skip mode is removed in the new api
> Ques 2:  if user handle counter correctly in streaming, can we locate the
> exact bad record
> Ques 3:  when in skip mode, why not locate more bad records by restart the
> user logic instead of locate one bad record for each task attempt
>
> Thank you!
>
> Dasheng Jiang



-- 
Harsh J

Re: Question about Skip Bad Records

Posted by Harsh J <ha...@cloudera.com>.
Hi,

Please see comments in https://issues.apache.org/jira/browse/MAPREDUCE-1932

On Sat, Jun 15, 2013 at 12:09 PM, 小强 <79...@qq.com> wrote:
> Hi, I found the SkippingRecordReader is no longer supported in the new api
> and I am curious about the reason, can anyone tell me.
>
> Besides, when I look into the old api and try to figure out what skip mode
> was doing, I am a little confused about the logic there.
> In my comprehension, if java api is used we can always precisely locate
> which one is the bad record.
> If streaming is used, as long as user can handle the counter correctly (I
> mean accumulate the counter for each record in), we can also locate the
> exact bad record. (I wonder if I miss something here)
> But if user don't care about the counter it's always a disaster for the
> framework to locate bad records (even using binary search)
>
> To sum up:
> Ques 1:  why skip mode is removed in the new api
> Ques 2:  if user handle counter correctly in streaming, can we locate the
> exact bad record
> Ques 3:  when in skip mode, why not locate more bad records by restart the
> user logic instead of locate one bad record for each task attempt
>
> Thank you!
>
> Dasheng Jiang



-- 
Harsh J