You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by William Oberman <ob...@civicscience.com> on 2013/06/10 19:23:21 UTC

gz containing null chars?

I posted this to the pig mailing list, but it might be more related to
hadoop itself, I'm not sure.

Quick recap: I had a file of "\n" separated lines of JSON.  I decided to
compress it to save on storage costs.  After compression I got a different
answer for a pig query that basically == "count lines".

After a lot of digging, I found an input file that had a line that is a
huge block of null characters followed by a "\n".  I wrote scripts to
examine the file directly, and if I stop counting at the weird line, I get
the same count as what pig claims for that file.   If I count all lines
(e.g. don't fail at the corrupt line) I get the "uncompressed" count pig
claims.

I don't know how to debug hadoop/pig quite as well, though I'm trying now.
 But, my working theory is that some combination of pig/hadoop aborts
processing the gz stream on a null character (or something like that), but
keeps chugging on a non-gz stream.  Does that sound familiar or make sense
to anyone?

will

Re: gz containing null chars?

Posted by Niels Basjes <Ni...@basjes.nl>.
My best guess is that at a low level a string is often terminated by having
a null byte at the end.
Perhaps that's where the difference lies.
Perhaps the gz decompressor simply stops at the null byte and the basic
record reader that follows simply continues.
In this situation your input file contains bytes that should not occur in
an ASCII file (like the json file you have) and as such you can expect the
unexpected ;)

Niels
On Jun 10, 2013 7:24 PM, "William Oberman" <ob...@civicscience.com> wrote:

> I posted this to the pig mailing list, but it might be more related to
> hadoop itself, I'm not sure.
>
> Quick recap: I had a file of "\n" separated lines of JSON.  I decided to
> compress it to save on storage costs.  After compression I got a different
> answer for a pig query that basically == "count lines".
>
> After a lot of digging, I found an input file that had a line that is a
> huge block of null characters followed by a "\n".  I wrote scripts to
> examine the file directly, and if I stop counting at the weird line, I get
> the same count as what pig claims for that file.   If I count all lines
> (e.g. don't fail at the corrupt line) I get the "uncompressed" count pig
> claims.
>
> I don't know how to debug hadoop/pig quite as well, though I'm trying now.
>  But, my working theory is that some combination of pig/hadoop aborts
> processing the gz stream on a null character (or something like that), but
> keeps chugging on a non-gz stream.  Does that sound familiar or make sense
> to anyone?
>
> will
>

Re: gz containing null chars?

Posted by Niels Basjes <Ni...@basjes.nl>.
My best guess is that at a low level a string is often terminated by having
a null byte at the end.
Perhaps that's where the difference lies.
Perhaps the gz decompressor simply stops at the null byte and the basic
record reader that follows simply continues.
In this situation your input file contains bytes that should not occur in
an ASCII file (like the json file you have) and as such you can expect the
unexpected ;)

Niels
On Jun 10, 2013 7:24 PM, "William Oberman" <ob...@civicscience.com> wrote:

> I posted this to the pig mailing list, but it might be more related to
> hadoop itself, I'm not sure.
>
> Quick recap: I had a file of "\n" separated lines of JSON.  I decided to
> compress it to save on storage costs.  After compression I got a different
> answer for a pig query that basically == "count lines".
>
> After a lot of digging, I found an input file that had a line that is a
> huge block of null characters followed by a "\n".  I wrote scripts to
> examine the file directly, and if I stop counting at the weird line, I get
> the same count as what pig claims for that file.   If I count all lines
> (e.g. don't fail at the corrupt line) I get the "uncompressed" count pig
> claims.
>
> I don't know how to debug hadoop/pig quite as well, though I'm trying now.
>  But, my working theory is that some combination of pig/hadoop aborts
> processing the gz stream on a null character (or something like that), but
> keeps chugging on a non-gz stream.  Does that sound familiar or make sense
> to anyone?
>
> will
>

Re: gz containing null chars?

Posted by Niels Basjes <Ni...@basjes.nl>.
My best guess is that at a low level a string is often terminated by having
a null byte at the end.
Perhaps that's where the difference lies.
Perhaps the gz decompressor simply stops at the null byte and the basic
record reader that follows simply continues.
In this situation your input file contains bytes that should not occur in
an ASCII file (like the json file you have) and as such you can expect the
unexpected ;)

Niels
On Jun 10, 2013 7:24 PM, "William Oberman" <ob...@civicscience.com> wrote:

> I posted this to the pig mailing list, but it might be more related to
> hadoop itself, I'm not sure.
>
> Quick recap: I had a file of "\n" separated lines of JSON.  I decided to
> compress it to save on storage costs.  After compression I got a different
> answer for a pig query that basically == "count lines".
>
> After a lot of digging, I found an input file that had a line that is a
> huge block of null characters followed by a "\n".  I wrote scripts to
> examine the file directly, and if I stop counting at the weird line, I get
> the same count as what pig claims for that file.   If I count all lines
> (e.g. don't fail at the corrupt line) I get the "uncompressed" count pig
> claims.
>
> I don't know how to debug hadoop/pig quite as well, though I'm trying now.
>  But, my working theory is that some combination of pig/hadoop aborts
> processing the gz stream on a null character (or something like that), but
> keeps chugging on a non-gz stream.  Does that sound familiar or make sense
> to anyone?
>
> will
>

Re: gz containing null chars?

Posted by Niels Basjes <Ni...@basjes.nl>.
My best guess is that at a low level a string is often terminated by having
a null byte at the end.
Perhaps that's where the difference lies.
Perhaps the gz decompressor simply stops at the null byte and the basic
record reader that follows simply continues.
In this situation your input file contains bytes that should not occur in
an ASCII file (like the json file you have) and as such you can expect the
unexpected ;)

Niels
On Jun 10, 2013 7:24 PM, "William Oberman" <ob...@civicscience.com> wrote:

> I posted this to the pig mailing list, but it might be more related to
> hadoop itself, I'm not sure.
>
> Quick recap: I had a file of "\n" separated lines of JSON.  I decided to
> compress it to save on storage costs.  After compression I got a different
> answer for a pig query that basically == "count lines".
>
> After a lot of digging, I found an input file that had a line that is a
> huge block of null characters followed by a "\n".  I wrote scripts to
> examine the file directly, and if I stop counting at the weird line, I get
> the same count as what pig claims for that file.   If I count all lines
> (e.g. don't fail at the corrupt line) I get the "uncompressed" count pig
> claims.
>
> I don't know how to debug hadoop/pig quite as well, though I'm trying now.
>  But, my working theory is that some combination of pig/hadoop aborts
> processing the gz stream on a null character (or something like that), but
> keeps chugging on a non-gz stream.  Does that sound familiar or make sense
> to anyone?
>
> will
>