You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Oscar Gothberg <Os...@platform-a.com> on 2009/01/10 01:26:36 UTC

Problem with Hadoop and concatenated gzip files

Hi,

I'm having trouble with Hadoop (tested with 0.17 and 0.19) not fully processing certain gzipped input files. Basically it only actually reads and processes a first part of the gzipped file, and just ignores the rest without any kind of warning.

It affects at least (but is maybe not limited to?) any gzip files that are a result of concatenation (which should be legal to do with gzip format):
http://www.gnu.org/software/gzip/manual/gzip.html#Advanced-usage

Repro case, using the "WordCount" example from the hadoop distribution:
$ echo 'one two three' > f1
$ echo 'four five six' > f2
$ gzip -c f1 > combined_file.gz
$ gzip -c f2 >> combined_file.gz

Now, if I run "WordCount" with combined_file.gz as input, it will only find the words 'one', 'two', 'three', but not 'four', 'five', 'six'.

It seems Java's GZIPInputStream may have a similar issue:
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4691425

Now, if I unzip and re-gzip this 'combined_file.gz' manually, the problem goes away.

It's especially dangerous since Hadoop doesn't show any errors or complains in the least. It just ignores this extra input. The only way of noticing is to run one's app with gzipped- and unzipped data side by side and notice the record counts being different.

Is anyone else familiar with this problem? Any solutions, workarounds, short of re-gzipping very large amounts of data?

Thanks!
/ Oscar

________________________________
The information transmitted in this email is intended only for the person(s) or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this email in error, please contact the sender and permanently delete the email from any computer.

RE: Problem with Hadoop and concatenated gzip files

Posted by Oscar Gothberg <Os...@platform-a.com>.
Thanks Tom,

yes, assuming I got native libraries correctly enabled... I get:

09/01/12 11:33:19 INFO util.NativeCodeLoader: Loaded the native-hadoop library
09/01/12 11:33:19 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library

...at startup, and then I try without by doing another run where I move the 'native' directory out of 'lib' and get this:

09/01/12 11:33:55 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

...either way it only reads half the file in the example below.

/ Oscar

-----Original Message-----
From: Tom White [mailto:tom@cloudera.com]
Sent: Monday, January 12, 2009 5:20 AM
To: core-user@hadoop.apache.org
Subject: Re: Problem with Hadoop and concatenated gzip files

I've opened https://issues.apache.org/jira/browse/HADOOP-5014 for this.

Do you get this behaviour when you use the native libraries?

Tom

On Sat, Jan 10, 2009 at 12:26 AM, Oscar Gothberg
<Os...@platform-a.com> wrote:
> Hi,
>
> I'm having trouble with Hadoop (tested with 0.17 and 0.19) not fully processing certain gzipped input files. Basically it only actually reads and processes a first part of the gzipped file, and just ignores the rest without any kind of warning.
>
> It affects at least (but is maybe not limited to?) any gzip files that are a result of concatenation (which should be legal to do with gzip format):
> http://www.gnu.org/software/gzip/manual/gzip.html#Advanced-usage
>
> Repro case, using the "WordCount" example from the hadoop distribution:
> $ echo 'one two three' > f1
> $ echo 'four five six' > f2
> $ gzip -c f1 > combined_file.gz
> $ gzip -c f2 >> combined_file.gz
>
> Now, if I run "WordCount" with combined_file.gz as input, it will only find the words 'one', 'two', 'three', but not 'four', 'five', 'six'.
>
> It seems Java's GZIPInputStream may have a similar issue:
> http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4691425
>
> Now, if I unzip and re-gzip this 'combined_file.gz' manually, the problem goes away.
>
> It's especially dangerous since Hadoop doesn't show any errors or complains in the least. It just ignores this extra input. The only way of noticing is to run one's app with gzipped- and unzipped data side by side and notice the record counts being different.
>
> Is anyone else familiar with this problem? Any solutions, workarounds, short of re-gzipping very large amounts of data?
>
> Thanks!
> / Oscar
>
> ________________________________
> The information transmitted in this email is intended only for the person(s) or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this email in error, please contact the sender and permanently delete the email from any computer.
>

The information transmitted in this email is intended only for the person(s) or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this email in error, please contact the sender and permanently delete the email from any computer.

Re: Problem with Hadoop and concatenated gzip files

Posted by Tom White <to...@cloudera.com>.
I've opened https://issues.apache.org/jira/browse/HADOOP-5014 for this.

Do you get this behaviour when you use the native libraries?

Tom

On Sat, Jan 10, 2009 at 12:26 AM, Oscar Gothberg
<Os...@platform-a.com> wrote:
> Hi,
>
> I'm having trouble with Hadoop (tested with 0.17 and 0.19) not fully processing certain gzipped input files. Basically it only actually reads and processes a first part of the gzipped file, and just ignores the rest without any kind of warning.
>
> It affects at least (but is maybe not limited to?) any gzip files that are a result of concatenation (which should be legal to do with gzip format):
> http://www.gnu.org/software/gzip/manual/gzip.html#Advanced-usage
>
> Repro case, using the "WordCount" example from the hadoop distribution:
> $ echo 'one two three' > f1
> $ echo 'four five six' > f2
> $ gzip -c f1 > combined_file.gz
> $ gzip -c f2 >> combined_file.gz
>
> Now, if I run "WordCount" with combined_file.gz as input, it will only find the words 'one', 'two', 'three', but not 'four', 'five', 'six'.
>
> It seems Java's GZIPInputStream may have a similar issue:
> http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4691425
>
> Now, if I unzip and re-gzip this 'combined_file.gz' manually, the problem goes away.
>
> It's especially dangerous since Hadoop doesn't show any errors or complains in the least. It just ignores this extra input. The only way of noticing is to run one's app with gzipped- and unzipped data side by side and notice the record counts being different.
>
> Is anyone else familiar with this problem? Any solutions, workarounds, short of re-gzipping very large amounts of data?
>
> Thanks!
> / Oscar
>
> ________________________________
> The information transmitted in this email is intended only for the person(s) or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this email in error, please contact the sender and permanently delete the email from any computer.
>