You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-dev@hadoop.apache.org by Mostafa Elhemali <mo...@gmail.com> on 2012/12/03 21:26:35 UTC

Fate of skipping-bad-records feature (MAPREDUCE-4840)

(Taking the discussion out of the JIRA so I can tap into the historical and
other knowledge of the group)

Hi all,
I was looking through MR code in trunk (always a fun weekend activity) and
was puzzled by the loose ends around the feature to skip bad records; by
loose ends I mean there's a lot of code in there but it can't really work.
I dug through JIRA's and found MAPREDUCE-1932 that I thought implied that
the feature is now intentionally dead, so I filed MAPREDUCE-4840 to delete
dead code and deprecate API's. In that new JIRA however Hash corrected me,
pointing out the MAPREDUCE-1932 only applied to the new API. So my
questions are:

1. Do we want to support the skip-bad-records feature for the old API in
trunk? Personally I think it's a bit weird to tie this feature to which API
you use since the feature is configured by config file, not as part of the
API, but I don't have a strong opinion either way.
2. Is there a JIRA/other work tracking enabling this feature in Yarn/trunk?
There are "Not yet implemented" exceptions being thrown in the code that
makes me think someone is aware of that and there's a plan to fix, so I'm
wondering where that is tracked.


Thanks,
Mostafa

Re: Fate of skipping-bad-records feature (MAPREDUCE-4840)

Posted by Harsh J <ha...@cloudera.com>.
Hi,

On Tue, Dec 4, 2012 at 1:56 AM, Mostafa Elhemali
<mo...@gmail.com> wrote:
> (Taking the discussion out of the JIRA so I can tap into the historical and
> other knowledge of the group)
>
> Hi all,
> I was looking through MR code in trunk (always a fun weekend activity) and
> was puzzled by the loose ends around the feature to skip bad records; by
> loose ends I mean there's a lot of code in there but it can't really work.
> I dug through JIRA's and found MAPREDUCE-1932 that I thought implied that
> the feature is now intentionally dead, so I filed MAPREDUCE-4840 to delete
> dead code and deprecate API's. In that new JIRA however Hash corrected me,
> pointing out the MAPREDUCE-1932 only applied to the new API. So my
> questions are:
>
> 1. Do we want to support the skip-bad-records feature for the old API in
> trunk? Personally I think it's a bit weird to tie this feature to which API
> you use since the feature is configured by config file, not as part of the
> API, but I don't have a strong opinion either way.

I personally think we should not support it anymore.

There are some hard-bindings to this feature set right into MR runtime
classes, which is why its not been trivial to get it done in the new
API as well.

> 2. Is there a JIRA/other work tracking enabling this feature in Yarn/trunk?
> There are "Not yet implemented" exceptions being thrown in the code that
> makes me think someone is aware of that and there's a plan to fix, so I'm
> wondering where that is tracked.

I'm not aware of anyone working on this. I did attempt it myself once,
pre-MR2 days, but we wound up deciding it is unsuitable to support
this ourselves.

--
Harsh J