You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Greg Roelofs <ro...@yahoo-inc.com> on 2010/06/15 21:55:18 UTC

concatenated gzip support: default on or not?

As some folks have found out the hard way, only the first member of a
concatenated gzip file is recognized by current versions of Hadoop,
including trunk; the remainder is silently ignored.  I'm working on
the fix (MAPREDUCE-469), and the question has come up whether to make
the fixed version the default, which would represent a behavior change.

So, three options:

(1) configurable; concatenation support not enabled by default
(2) configurable; concatenation support enabled by default (behavior change)
(3) not configurable; concatenation support always enabled (behavior change)

Opinions?  The current proto-patch makes it configurable but leaves the
default unchanged from previous behavior (option 1).  However, since the
failure is silent (and there doesn't appear to be an easy way to emit a
warning due to buffering effects: MAPREDUCE-1795), a number of users have
argued that this is purely a bug that needs to be fixed, in which case
perhaps (3) would be appropriate.  I'm personally sympathetic to this
view, FWIW; on the other hand, unanticipated, user-visible behavior
changes can lead to unhappiness, too.

Note that concatenated bzip2 streams are not supported in 0.20 but are
in trunk (reportedly--I haven't yet verified for myself), thanks to the
splittable-codec support.  AFAIK, this is not configurable--i.e., it's
similar to option (3) except with the benefit of extra functionality
included on top.

Thanks,
  Greg

Re: concatenated gzip support: default on or not?

Posted by Hong Tang <ht...@yahoo-inc.com>.
+1 for (3).

On Jun 15, 2010, at 12:55 PM, Greg Roelofs wrote:

> As some folks have found out the hard way, only the first member of a
> concatenated gzip file is recognized by current versions of Hadoop,
> including trunk; the remainder is silently ignored.  I'm working on
> the fix (MAPREDUCE-469), and the question has come up whether to make
> the fixed version the default, which would represent a behavior  
> change.
>
> So, three options:
>
> (1) configurable; concatenation support not enabled by default
> (2) configurable; concatenation support enabled by default (behavior  
> change)
> (3) not configurable; concatenation support always enabled (behavior  
> change)
>
> Opinions?  The current proto-patch makes it configurable but leaves  
> the
> default unchanged from previous behavior (option 1).  However, since  
> the
> failure is silent (and there doesn't appear to be an easy way to  
> emit a
> warning due to buffering effects: MAPREDUCE-1795), a number of users  
> have
> argued that this is purely a bug that needs to be fixed, in which case
> perhaps (3) would be appropriate.  I'm personally sympathetic to this
> view, FWIW; on the other hand, unanticipated, user-visible behavior
> changes can lead to unhappiness, too.
>
> Note that concatenated bzip2 streams are not supported in 0.20 but are
> in trunk (reportedly--I haven't yet verified for myself), thanks to  
> the
> splittable-codec support.  AFAIK, this is not configurable--i.e., it's
> similar to option (3) except with the benefit of extra functionality
> included on top.
>
> Thanks,
>  Greg


Re: concatenated gzip support: default on or not?

Posted by Greg Roelofs <ro...@yahoo-inc.com>.
> As some folks have found out the hard way, only the first member of a
> concatenated gzip file is recognized by current versions of Hadoop,
> including trunk; the remainder is silently ignored.  I'm working on
> the fix (MAPREDUCE-469), and the question has come up whether to make
> the fixed version the default, which would represent a behavior change.

> So, three options:

> (1) configurable; concatenation support not enabled by default
> (2) configurable; concatenation support enabled by default (behavior change)
> (3) not configurable; concatenation support always enabled (behavior change)

Not a vast amount of feedback, but the consensus is clearly for enabling
concatenation support by default, and there doesn't even seem to be any
real interest in making it configurable.

So the next version of the patch (incorporating informal review feedback
and dealing with most of my own FIXMEs) will go with (2) just because it's
trivial to do so, but if I don't hear any arguments against by, say, early
next week, the subsequent (final?) version will hardcode it (option (3)) to
match the bzip2 behavior.

Thanks,
  Greg