You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Vadim Zaliva <kr...@gmail.com> on 2009/03/14 10:00:08 UTC

bzip/gzip

I am considering starting to use compression for data files I process
with PIG. I am using trunk version of PIG
on Hadoop-0.18.3. Uncompressed files are about 500Mb each, and I plan
to have total few dozen terabytes of uncompressed data.
DFS block size I am using is 96Mb.

I am looking for a feedback on idea of using {B|G}ZIP compressed
files. Could PIG handle them? How would it affect splittng?
I have read somewhere that bzip files could be split, whereas gzip
could not. Could somebody confirm this?

Thanks!

Vadim

Re: bzip/gzip

Posted by Tamir Kamara <ta...@gmail.com>.
Hi,

What do you mean by IIRC ?

I tried this again just now and it seems that hadoop does compress the job
output (I guess I did something wrong before), I know this because of the
out view from the dfs web interface which displays the part files as if they
were compressed.
By the way, how do I get the compressed output out of the dfs and decompress
it ?

Thanks,
Tamir


On Mon, Mar 16, 2009 at 9:23 PM, Mridul Muralidharan
<mr...@yahoo-inc.com>wrote:

> Tamir Kamara wrote:
>
>> Hi,
>>
>> I did some testing with both gzip and bzip2.
>> As Alan wrote, bz has the advantage of being splittable out of the box but
>> the disadvantage is its performance both in compression and decompression
>> -
>> bz is slow I don't think the smaller file is worth it.
>> I also got wrong results when using bz files with the latest trunk which
>> suggests that there're still some problems. I've emailed the details of
>> the
>> problem here a week ago.
>> For now, when I need to I split the files manually and use gzip before
>> moving them into the dfs within a specific directory and then load that
>> entire directory with pig.
>> I also tried to use lzo but had some problems with it. What I did see is
>> that lzo is faster than gzip but produces larger files.
>> As I understand the situation, pig can only write to bz files but read
>> also
>> gz, lzo and zlib (handled by hadoop).
>> I originally wanted pig to write normal text files and have hadoop
>> compress
>> the output to the other compression types (e.g. lzo), and I configured
>> hadoop as mentioned in the docs but still got an uncompressed output. If
>> anyone knows how to use this feature, please write.
>>
>>
>
> Can you elaborate on what you mean by uncompressed output ?
>
> IIRC hadoop does block compression - so a fetch would give you uncompressed
> file, and a put compresses it on the fly - the actual data is stored in
> compressed form on the hdfs blocks.
> So disc 'space' goes up - with a (de)compression time tradeoff. All
> manipulation of the file would give you an uncompressed view of the file
> iirc.
>
> Please do let me know in case my understanding is wrong !
>
> Thanks,
> Mridul
>
>
>  Tamir
>>
>>
>> On Mon, Mar 16, 2009 at 5:10 PM, Alan Gates <ga...@yahoo-inc.com> wrote:
>>
>>  I haven't worked extensively with compressed data, so I'll let others who
>>> have share their experience.  But pig does work with bzip data, which can
>>> be
>>> split.  PigStorage checks to see if the input or output file ends in .bz,
>>> and if so uses bzip to read/write the data.  There have been some bugs in
>>> this code, so you should make sure you have the top of trunk version as
>>> it's
>>> been fixed fairly recently.
>>>
>>> gzip files cannot be split, and if you gzip your whole file, you can't
>>> really use it with map/reduce or pig.  But, hadoop now supports
>>> compressing
>>> each block.  As I understand it lzo is preferred over gzip for this.  But
>>> when you use this, it works fine with pig because hadoop handles the
>>> (de)compression underneath pig.  You should be able to find info on how
>>> to
>>> do this on your cluster in the hadoop docs.
>>>
>>> Alan.
>>>
>>>
>>> On Mar 14, 2009, at 2:00 AM, Vadim Zaliva wrote:
>>>
>>> I am considering starting to use compression for data files I process
>>>
>>>> with PIG. I am using trunk version of PIG
>>>> on Hadoop-0.18.3. Uncompressed files are about 500Mb each, and I plan
>>>> to have total few dozen terabytes of uncompressed data.
>>>> DFS block size I am using is 96Mb.
>>>>
>>>> I am looking for a feedback on idea of using {B|G}ZIP compressed
>>>> files. Could PIG handle them? How would it affect splittng?
>>>> I have read somewhere that bzip files could be split, whereas gzip
>>>> could not. Could somebody confirm this?
>>>>
>>>> Thanks!
>>>>
>>>> Vadim
>>>>
>>>>
>>>
>>
>

Re: bzip/gzip

Posted by Tamir Kamara <ta...@gmail.com>.
Hi,

To more specific I tested the lzo output compression with both pig and
hadoop. A MR job written without pig produces part files with the lzo suffix
which means that indeed hadoop compressed the out of the job. However, the
output of a similar job written in pig using the same configurtion related
to the compression (stated in the hadoop-site.xml I attached) doesn't
produce a compressed output. I also attached the job xml for the pig job
which shows pig gets all the relevant compression properties that should
produce a lzo output.

Am I doing something wrong ?


On Mon, Mar 16, 2009 at 9:23 PM, Mridul Muralidharan
<mr...@yahoo-inc.com>wrote:

> Tamir Kamara wrote:
>
>> Hi,
>>
>> I did some testing with both gzip and bzip2.
>> As Alan wrote, bz has the advantage of being splittable out of the box but
>> the disadvantage is its performance both in compression and decompression
>> -
>> bz is slow I don't think the smaller file is worth it.
>> I also got wrong results when using bz files with the latest trunk which
>> suggests that there're still some problems. I've emailed the details of
>> the
>> problem here a week ago.
>> For now, when I need to I split the files manually and use gzip before
>> moving them into the dfs within a specific directory and then load that
>> entire directory with pig.
>> I also tried to use lzo but had some problems with it. What I did see is
>> that lzo is faster than gzip but produces larger files.
>> As I understand the situation, pig can only write to bz files but read
>> also
>> gz, lzo and zlib (handled by hadoop).
>> I originally wanted pig to write normal text files and have hadoop
>> compress
>> the output to the other compression types (e.g. lzo), and I configured
>> hadoop as mentioned in the docs but still got an uncompressed output. If
>> anyone knows how to use this feature, please write.
>>
>>
>
> Can you elaborate on what you mean by uncompressed output ?
>
> IIRC hadoop does block compression - so a fetch would give you uncompressed
> file, and a put compresses it on the fly - the actual data is stored in
> compressed form on the hdfs blocks.
> So disc 'space' goes up - with a (de)compression time tradeoff. All
> manipulation of the file would give you an uncompressed view of the file
> iirc.
>
> Please do let me know in case my understanding is wrong !
>
> Thanks,
> Mridul
>
>
>  Tamir
>>
>>
>> On Mon, Mar 16, 2009 at 5:10 PM, Alan Gates <ga...@yahoo-inc.com> wrote:
>>
>>  I haven't worked extensively with compressed data, so I'll let others who
>>> have share their experience.  But pig does work with bzip data, which can
>>> be
>>> split.  PigStorage checks to see if the input or output file ends in .bz,
>>> and if so uses bzip to read/write the data.  There have been some bugs in
>>> this code, so you should make sure you have the top of trunk version as
>>> it's
>>> been fixed fairly recently.
>>>
>>> gzip files cannot be split, and if you gzip your whole file, you can't
>>> really use it with map/reduce or pig.  But, hadoop now supports
>>> compressing
>>> each block.  As I understand it lzo is preferred over gzip for this.  But
>>> when you use this, it works fine with pig because hadoop handles the
>>> (de)compression underneath pig.  You should be able to find info on how
>>> to
>>> do this on your cluster in the hadoop docs.
>>>
>>> Alan.
>>>
>>>
>>> On Mar 14, 2009, at 2:00 AM, Vadim Zaliva wrote:
>>>
>>> I am considering starting to use compression for data files I process
>>>
>>>> with PIG. I am using trunk version of PIG
>>>> on Hadoop-0.18.3. Uncompressed files are about 500Mb each, and I plan
>>>> to have total few dozen terabytes of uncompressed data.
>>>> DFS block size I am using is 96Mb.
>>>>
>>>> I am looking for a feedback on idea of using {B|G}ZIP compressed
>>>> files. Could PIG handle them? How would it affect splittng?
>>>> I have read somewhere that bzip files could be split, whereas gzip
>>>> could not. Could somebody confirm this?
>>>>
>>>> Thanks!
>>>>
>>>> Vadim
>>>>
>>>>
>>>
>>
>

Re: bzip/gzip

Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.
Tamir Kamara wrote:
> Hi,
> 
> I did some testing with both gzip and bzip2.
> As Alan wrote, bz has the advantage of being splittable out of the box but
> the disadvantage is its performance both in compression and decompression -
> bz is slow I don't think the smaller file is worth it.
> I also got wrong results when using bz files with the latest trunk which
> suggests that there're still some problems. I've emailed the details of the
> problem here a week ago.
> For now, when I need to I split the files manually and use gzip before
> moving them into the dfs within a specific directory and then load that
> entire directory with pig.
> I also tried to use lzo but had some problems with it. What I did see is
> that lzo is faster than gzip but produces larger files.
> As I understand the situation, pig can only write to bz files but read also
> gz, lzo and zlib (handled by hadoop).
> I originally wanted pig to write normal text files and have hadoop compress
> the output to the other compression types (e.g. lzo), and I configured
> hadoop as mentioned in the docs but still got an uncompressed output. If
> anyone knows how to use this feature, please write.
> 


Can you elaborate on what you mean by uncompressed output ?

IIRC hadoop does block compression - so a fetch would give you 
uncompressed file, and a put compresses it on the fly - the actual data 
is stored in compressed form on the hdfs blocks.
So disc 'space' goes up - with a (de)compression time tradeoff. All 
manipulation of the file would give you an uncompressed view of the file 
iirc.

Please do let me know in case my understanding is wrong !

Thanks,
Mridul

> Tamir
> 
> 
> On Mon, Mar 16, 2009 at 5:10 PM, Alan Gates <ga...@yahoo-inc.com> wrote:
> 
>> I haven't worked extensively with compressed data, so I'll let others who
>> have share their experience.  But pig does work with bzip data, which can be
>> split.  PigStorage checks to see if the input or output file ends in .bz,
>> and if so uses bzip to read/write the data.  There have been some bugs in
>> this code, so you should make sure you have the top of trunk version as it's
>> been fixed fairly recently.
>>
>> gzip files cannot be split, and if you gzip your whole file, you can't
>> really use it with map/reduce or pig.  But, hadoop now supports compressing
>> each block.  As I understand it lzo is preferred over gzip for this.  But
>> when you use this, it works fine with pig because hadoop handles the
>> (de)compression underneath pig.  You should be able to find info on how to
>> do this on your cluster in the hadoop docs.
>>
>> Alan.
>>
>>
>> On Mar 14, 2009, at 2:00 AM, Vadim Zaliva wrote:
>>
>> I am considering starting to use compression for data files I process
>>> with PIG. I am using trunk version of PIG
>>> on Hadoop-0.18.3. Uncompressed files are about 500Mb each, and I plan
>>> to have total few dozen terabytes of uncompressed data.
>>> DFS block size I am using is 96Mb.
>>>
>>> I am looking for a feedback on idea of using {B|G}ZIP compressed
>>> files. Could PIG handle them? How would it affect splittng?
>>> I have read somewhere that bzip files could be split, whereas gzip
>>> could not. Could somebody confirm this?
>>>
>>> Thanks!
>>>
>>> Vadim
>>>
>>
> 


Re: bzip/gzip

Posted by Vadim Zaliva <kr...@gmail.com>.
I am experimenting with BZIP2-compressed data and now my task get killed
after getting bunch of these:

Task attempt_200903131720_0047_m_000270_0 failed to report status for
627 seconds. Killing!

I suspect it is bzip-related, but I am not 100% sure yet.

Vadim

On Tue, Mar 17, 2009 at 08:50, Tamir Kamara <ta...@gmail.com> wrote:
> I have the data and I just verified that the problem described is still
> happening. Do you want me to try something else on it ?
>
>
> On Tue, Mar 17, 2009 at 4:33 PM, Benjamin Reed <br...@yahoo-inc.com> wrote:
>
>> is there a way to reproduce the dataset?
>> thanx
>> ben
>>
>> -----Original Message-----
>> From: Tamir Kamara [mailto:tamirkamara@gmail.com]
>> Sent: Tuesday, March 17, 2009 6:19 AM
>> To: pig-user@hadoop.apache.org
>> Subject: Re: bzip/gzip
>>
>> Sure. My query is simple enough:
>>
>> --links = LOAD '/user/hadoop/links/links.txt.bz2' AS (target:int,
>> source:int);
>> links = LOAD '/user/hadoop/links/links-gz/*' AS (target:int, source:int);
>> a = filter links by target==98;
>> a1 = foreach a generate source;
>> b = JOIN links by source, a1 by source USING "replicated";
>> c = group b by links::source;
>> d = foreach c generate group as source, COUNT(*);
>> dump d;
>>
>> I used the same source file to create both the bz file and the splitted gz
>> files. The right results were produced with the gz files and bz results
>> were
>> off by 1 or 2 for all records.
>>
>> Thanks,
>> Tamir
>>
>>
>> On Tue, Mar 17, 2009 at 3:08 PM, Benjamin Reed <br...@yahoo-inc.com>
>> wrote:
>>
>> > can you give more information on the wrong results you are getting? it
>> > would be great if we could reproduce the problem.
>> >
>> > ben
>> >
>> > -----Original Message-----
>> > From: Tamir Kamara [mailto:tamirkamara@gmail.com]
>> > Sent: Monday, March 16, 2009 11:10 AM
>> > To: pig-user@hadoop.apache.org
>> > Subject: Re: bzip/gzip
>> >
>> > Hi,
>> >
>> > I did some testing with both gzip and bzip2.
>> > As Alan wrote, bz has the advantage of being splittable out of the box
>> but
>> > the disadvantage is its performance both in compression and decompression
>> -
>> > bz is slow I don't think the smaller file is worth it.
>> > I also got wrong results when using bz files with the latest trunk which
>> > suggests that there're still some problems. I've emailed the details of
>> the
>> > problem here a week ago.
>> > For now, when I need to I split the files manually and use gzip before
>> > moving them into the dfs within a specific directory and then load that
>> > entire directory with pig.
>> > I also tried to use lzo but had some problems with it. What I did see is
>> > that lzo is faster than gzip but produces larger files.
>> > As I understand the situation, pig can only write to bz files but read
>> also
>> > gz, lzo and zlib (handled by hadoop).
>> > I originally wanted pig to write normal text files and have hadoop
>> compress
>> > the output to the other compression types (e.g. lzo), and I configured
>> > hadoop as mentioned in the docs but still got an uncompressed output. If
>> > anyone knows how to use this feature, please write.
>> >
>> > Tamir
>> >
>> >
>> > On Mon, Mar 16, 2009 at 5:10 PM, Alan Gates <ga...@yahoo-inc.com> wrote:
>> >
>> > > I haven't worked extensively with compressed data, so I'll let others
>> who
>> > > have share their experience.  But pig does work with bzip data, which
>> can
>> > be
>> > > split.  PigStorage checks to see if the input or output file ends in
>> .bz,
>> > > and if so uses bzip to read/write the data.  There have been some bugs
>> in
>> > > this code, so you should make sure you have the top of trunk version as
>> > it's
>> > > been fixed fairly recently.
>> > >
>> > > gzip files cannot be split, and if you gzip your whole file, you can't
>> > > really use it with map/reduce or pig.  But, hadoop now supports
>> > compressing
>> > > each block.  As I understand it lzo is preferred over gzip for this.
>>  But
>> > > when you use this, it works fine with pig because hadoop handles the
>> > > (de)compression underneath pig.  You should be able to find info on how
>> > to
>> > > do this on your cluster in the hadoop docs.
>> > >
>> > > Alan.
>> > >
>> > >
>> > > On Mar 14, 2009, at 2:00 AM, Vadim Zaliva wrote:
>> > >
>> > > I am considering starting to use compression for data files I process
>> > >> with PIG. I am using trunk version of PIG
>> > >> on Hadoop-0.18.3. Uncompressed files are about 500Mb each, and I plan
>> > >> to have total few dozen terabytes of uncompressed data.
>> > >> DFS block size I am using is 96Mb.
>> > >>
>> > >> I am looking for a feedback on idea of using {B|G}ZIP compressed
>> > >> files. Could PIG handle them? How would it affect splittng?
>> > >> I have read somewhere that bzip files could be split, whereas gzip
>> > >> could not. Could somebody confirm this?
>> > >>
>> > >> Thanks!
>> > >>
>> > >> Vadim
>> > >>
>> > >
>> > >
>> >
>>
>

Re: bzip/gzip

Posted by Tamir Kamara <ta...@gmail.com>.
I have the data and I just verified that the problem described is still
happening. Do you want me to try something else on it ?


On Tue, Mar 17, 2009 at 4:33 PM, Benjamin Reed <br...@yahoo-inc.com> wrote:

> is there a way to reproduce the dataset?
> thanx
> ben
>
> -----Original Message-----
> From: Tamir Kamara [mailto:tamirkamara@gmail.com]
> Sent: Tuesday, March 17, 2009 6:19 AM
> To: pig-user@hadoop.apache.org
> Subject: Re: bzip/gzip
>
> Sure. My query is simple enough:
>
> --links = LOAD '/user/hadoop/links/links.txt.bz2' AS (target:int,
> source:int);
> links = LOAD '/user/hadoop/links/links-gz/*' AS (target:int, source:int);
> a = filter links by target==98;
> a1 = foreach a generate source;
> b = JOIN links by source, a1 by source USING "replicated";
> c = group b by links::source;
> d = foreach c generate group as source, COUNT(*);
> dump d;
>
> I used the same source file to create both the bz file and the splitted gz
> files. The right results were produced with the gz files and bz results
> were
> off by 1 or 2 for all records.
>
> Thanks,
> Tamir
>
>
> On Tue, Mar 17, 2009 at 3:08 PM, Benjamin Reed <br...@yahoo-inc.com>
> wrote:
>
> > can you give more information on the wrong results you are getting? it
> > would be great if we could reproduce the problem.
> >
> > ben
> >
> > -----Original Message-----
> > From: Tamir Kamara [mailto:tamirkamara@gmail.com]
> > Sent: Monday, March 16, 2009 11:10 AM
> > To: pig-user@hadoop.apache.org
> > Subject: Re: bzip/gzip
> >
> > Hi,
> >
> > I did some testing with both gzip and bzip2.
> > As Alan wrote, bz has the advantage of being splittable out of the box
> but
> > the disadvantage is its performance both in compression and decompression
> -
> > bz is slow I don't think the smaller file is worth it.
> > I also got wrong results when using bz files with the latest trunk which
> > suggests that there're still some problems. I've emailed the details of
> the
> > problem here a week ago.
> > For now, when I need to I split the files manually and use gzip before
> > moving them into the dfs within a specific directory and then load that
> > entire directory with pig.
> > I also tried to use lzo but had some problems with it. What I did see is
> > that lzo is faster than gzip but produces larger files.
> > As I understand the situation, pig can only write to bz files but read
> also
> > gz, lzo and zlib (handled by hadoop).
> > I originally wanted pig to write normal text files and have hadoop
> compress
> > the output to the other compression types (e.g. lzo), and I configured
> > hadoop as mentioned in the docs but still got an uncompressed output. If
> > anyone knows how to use this feature, please write.
> >
> > Tamir
> >
> >
> > On Mon, Mar 16, 2009 at 5:10 PM, Alan Gates <ga...@yahoo-inc.com> wrote:
> >
> > > I haven't worked extensively with compressed data, so I'll let others
> who
> > > have share their experience.  But pig does work with bzip data, which
> can
> > be
> > > split.  PigStorage checks to see if the input or output file ends in
> .bz,
> > > and if so uses bzip to read/write the data.  There have been some bugs
> in
> > > this code, so you should make sure you have the top of trunk version as
> > it's
> > > been fixed fairly recently.
> > >
> > > gzip files cannot be split, and if you gzip your whole file, you can't
> > > really use it with map/reduce or pig.  But, hadoop now supports
> > compressing
> > > each block.  As I understand it lzo is preferred over gzip for this.
>  But
> > > when you use this, it works fine with pig because hadoop handles the
> > > (de)compression underneath pig.  You should be able to find info on how
> > to
> > > do this on your cluster in the hadoop docs.
> > >
> > > Alan.
> > >
> > >
> > > On Mar 14, 2009, at 2:00 AM, Vadim Zaliva wrote:
> > >
> > > I am considering starting to use compression for data files I process
> > >> with PIG. I am using trunk version of PIG
> > >> on Hadoop-0.18.3. Uncompressed files are about 500Mb each, and I plan
> > >> to have total few dozen terabytes of uncompressed data.
> > >> DFS block size I am using is 96Mb.
> > >>
> > >> I am looking for a feedback on idea of using {B|G}ZIP compressed
> > >> files. Could PIG handle them? How would it affect splittng?
> > >> I have read somewhere that bzip files could be split, whereas gzip
> > >> could not. Could somebody confirm this?
> > >>
> > >> Thanks!
> > >>
> > >> Vadim
> > >>
> > >
> > >
> >
>

RE: bzip/gzip

Posted by Benjamin Reed <br...@yahoo-inc.com>.
is there a way to reproduce the dataset?
thanx
ben

-----Original Message-----
From: Tamir Kamara [mailto:tamirkamara@gmail.com] 
Sent: Tuesday, March 17, 2009 6:19 AM
To: pig-user@hadoop.apache.org
Subject: Re: bzip/gzip

Sure. My query is simple enough:

--links = LOAD '/user/hadoop/links/links.txt.bz2' AS (target:int,
source:int);
links = LOAD '/user/hadoop/links/links-gz/*' AS (target:int, source:int);
a = filter links by target==98;
a1 = foreach a generate source;
b = JOIN links by source, a1 by source USING "replicated";
c = group b by links::source;
d = foreach c generate group as source, COUNT(*);
dump d;

I used the same source file to create both the bz file and the splitted gz
files. The right results were produced with the gz files and bz results were
off by 1 or 2 for all records.

Thanks,
Tamir


On Tue, Mar 17, 2009 at 3:08 PM, Benjamin Reed <br...@yahoo-inc.com> wrote:

> can you give more information on the wrong results you are getting? it
> would be great if we could reproduce the problem.
>
> ben
>
> -----Original Message-----
> From: Tamir Kamara [mailto:tamirkamara@gmail.com]
> Sent: Monday, March 16, 2009 11:10 AM
> To: pig-user@hadoop.apache.org
> Subject: Re: bzip/gzip
>
> Hi,
>
> I did some testing with both gzip and bzip2.
> As Alan wrote, bz has the advantage of being splittable out of the box but
> the disadvantage is its performance both in compression and decompression -
> bz is slow I don't think the smaller file is worth it.
> I also got wrong results when using bz files with the latest trunk which
> suggests that there're still some problems. I've emailed the details of the
> problem here a week ago.
> For now, when I need to I split the files manually and use gzip before
> moving them into the dfs within a specific directory and then load that
> entire directory with pig.
> I also tried to use lzo but had some problems with it. What I did see is
> that lzo is faster than gzip but produces larger files.
> As I understand the situation, pig can only write to bz files but read also
> gz, lzo and zlib (handled by hadoop).
> I originally wanted pig to write normal text files and have hadoop compress
> the output to the other compression types (e.g. lzo), and I configured
> hadoop as mentioned in the docs but still got an uncompressed output. If
> anyone knows how to use this feature, please write.
>
> Tamir
>
>
> On Mon, Mar 16, 2009 at 5:10 PM, Alan Gates <ga...@yahoo-inc.com> wrote:
>
> > I haven't worked extensively with compressed data, so I'll let others who
> > have share their experience.  But pig does work with bzip data, which can
> be
> > split.  PigStorage checks to see if the input or output file ends in .bz,
> > and if so uses bzip to read/write the data.  There have been some bugs in
> > this code, so you should make sure you have the top of trunk version as
> it's
> > been fixed fairly recently.
> >
> > gzip files cannot be split, and if you gzip your whole file, you can't
> > really use it with map/reduce or pig.  But, hadoop now supports
> compressing
> > each block.  As I understand it lzo is preferred over gzip for this.  But
> > when you use this, it works fine with pig because hadoop handles the
> > (de)compression underneath pig.  You should be able to find info on how
> to
> > do this on your cluster in the hadoop docs.
> >
> > Alan.
> >
> >
> > On Mar 14, 2009, at 2:00 AM, Vadim Zaliva wrote:
> >
> > I am considering starting to use compression for data files I process
> >> with PIG. I am using trunk version of PIG
> >> on Hadoop-0.18.3. Uncompressed files are about 500Mb each, and I plan
> >> to have total few dozen terabytes of uncompressed data.
> >> DFS block size I am using is 96Mb.
> >>
> >> I am looking for a feedback on idea of using {B|G}ZIP compressed
> >> files. Could PIG handle them? How would it affect splittng?
> >> I have read somewhere that bzip files could be split, whereas gzip
> >> could not. Could somebody confirm this?
> >>
> >> Thanks!
> >>
> >> Vadim
> >>
> >
> >
>

Re: bzip/gzip

Posted by Tamir Kamara <ta...@gmail.com>.
Sure. My query is simple enough:

--links = LOAD '/user/hadoop/links/links.txt.bz2' AS (target:int,
source:int);
links = LOAD '/user/hadoop/links/links-gz/*' AS (target:int, source:int);
a = filter links by target==98;
a1 = foreach a generate source;
b = JOIN links by source, a1 by source USING "replicated";
c = group b by links::source;
d = foreach c generate group as source, COUNT(*);
dump d;

I used the same source file to create both the bz file and the splitted gz
files. The right results were produced with the gz files and bz results were
off by 1 or 2 for all records.

Thanks,
Tamir


On Tue, Mar 17, 2009 at 3:08 PM, Benjamin Reed <br...@yahoo-inc.com> wrote:

> can you give more information on the wrong results you are getting? it
> would be great if we could reproduce the problem.
>
> ben
>
> -----Original Message-----
> From: Tamir Kamara [mailto:tamirkamara@gmail.com]
> Sent: Monday, March 16, 2009 11:10 AM
> To: pig-user@hadoop.apache.org
> Subject: Re: bzip/gzip
>
> Hi,
>
> I did some testing with both gzip and bzip2.
> As Alan wrote, bz has the advantage of being splittable out of the box but
> the disadvantage is its performance both in compression and decompression -
> bz is slow I don't think the smaller file is worth it.
> I also got wrong results when using bz files with the latest trunk which
> suggests that there're still some problems. I've emailed the details of the
> problem here a week ago.
> For now, when I need to I split the files manually and use gzip before
> moving them into the dfs within a specific directory and then load that
> entire directory with pig.
> I also tried to use lzo but had some problems with it. What I did see is
> that lzo is faster than gzip but produces larger files.
> As I understand the situation, pig can only write to bz files but read also
> gz, lzo and zlib (handled by hadoop).
> I originally wanted pig to write normal text files and have hadoop compress
> the output to the other compression types (e.g. lzo), and I configured
> hadoop as mentioned in the docs but still got an uncompressed output. If
> anyone knows how to use this feature, please write.
>
> Tamir
>
>
> On Mon, Mar 16, 2009 at 5:10 PM, Alan Gates <ga...@yahoo-inc.com> wrote:
>
> > I haven't worked extensively with compressed data, so I'll let others who
> > have share their experience.  But pig does work with bzip data, which can
> be
> > split.  PigStorage checks to see if the input or output file ends in .bz,
> > and if so uses bzip to read/write the data.  There have been some bugs in
> > this code, so you should make sure you have the top of trunk version as
> it's
> > been fixed fairly recently.
> >
> > gzip files cannot be split, and if you gzip your whole file, you can't
> > really use it with map/reduce or pig.  But, hadoop now supports
> compressing
> > each block.  As I understand it lzo is preferred over gzip for this.  But
> > when you use this, it works fine with pig because hadoop handles the
> > (de)compression underneath pig.  You should be able to find info on how
> to
> > do this on your cluster in the hadoop docs.
> >
> > Alan.
> >
> >
> > On Mar 14, 2009, at 2:00 AM, Vadim Zaliva wrote:
> >
> > I am considering starting to use compression for data files I process
> >> with PIG. I am using trunk version of PIG
> >> on Hadoop-0.18.3. Uncompressed files are about 500Mb each, and I plan
> >> to have total few dozen terabytes of uncompressed data.
> >> DFS block size I am using is 96Mb.
> >>
> >> I am looking for a feedback on idea of using {B|G}ZIP compressed
> >> files. Could PIG handle them? How would it affect splittng?
> >> I have read somewhere that bzip files could be split, whereas gzip
> >> could not. Could somebody confirm this?
> >>
> >> Thanks!
> >>
> >> Vadim
> >>
> >
> >
>

RE: bzip/gzip

Posted by Benjamin Reed <br...@yahoo-inc.com>.
can you give more information on the wrong results you are getting? it would be great if we could reproduce the problem.

ben

-----Original Message-----
From: Tamir Kamara [mailto:tamirkamara@gmail.com] 
Sent: Monday, March 16, 2009 11:10 AM
To: pig-user@hadoop.apache.org
Subject: Re: bzip/gzip

Hi,

I did some testing with both gzip and bzip2.
As Alan wrote, bz has the advantage of being splittable out of the box but
the disadvantage is its performance both in compression and decompression -
bz is slow I don't think the smaller file is worth it.
I also got wrong results when using bz files with the latest trunk which
suggests that there're still some problems. I've emailed the details of the
problem here a week ago.
For now, when I need to I split the files manually and use gzip before
moving them into the dfs within a specific directory and then load that
entire directory with pig.
I also tried to use lzo but had some problems with it. What I did see is
that lzo is faster than gzip but produces larger files.
As I understand the situation, pig can only write to bz files but read also
gz, lzo and zlib (handled by hadoop).
I originally wanted pig to write normal text files and have hadoop compress
the output to the other compression types (e.g. lzo), and I configured
hadoop as mentioned in the docs but still got an uncompressed output. If
anyone knows how to use this feature, please write.

Tamir


On Mon, Mar 16, 2009 at 5:10 PM, Alan Gates <ga...@yahoo-inc.com> wrote:

> I haven't worked extensively with compressed data, so I'll let others who
> have share their experience.  But pig does work with bzip data, which can be
> split.  PigStorage checks to see if the input or output file ends in .bz,
> and if so uses bzip to read/write the data.  There have been some bugs in
> this code, so you should make sure you have the top of trunk version as it's
> been fixed fairly recently.
>
> gzip files cannot be split, and if you gzip your whole file, you can't
> really use it with map/reduce or pig.  But, hadoop now supports compressing
> each block.  As I understand it lzo is preferred over gzip for this.  But
> when you use this, it works fine with pig because hadoop handles the
> (de)compression underneath pig.  You should be able to find info on how to
> do this on your cluster in the hadoop docs.
>
> Alan.
>
>
> On Mar 14, 2009, at 2:00 AM, Vadim Zaliva wrote:
>
> I am considering starting to use compression for data files I process
>> with PIG. I am using trunk version of PIG
>> on Hadoop-0.18.3. Uncompressed files are about 500Mb each, and I plan
>> to have total few dozen terabytes of uncompressed data.
>> DFS block size I am using is 96Mb.
>>
>> I am looking for a feedback on idea of using {B|G}ZIP compressed
>> files. Could PIG handle them? How would it affect splittng?
>> I have read somewhere that bzip files could be split, whereas gzip
>> could not. Could somebody confirm this?
>>
>> Thanks!
>>
>> Vadim
>>
>
>

Re: bzip/gzip

Posted by Tamir Kamara <ta...@gmail.com>.
Hi,

I did some testing with both gzip and bzip2.
As Alan wrote, bz has the advantage of being splittable out of the box but
the disadvantage is its performance both in compression and decompression -
bz is slow I don't think the smaller file is worth it.
I also got wrong results when using bz files with the latest trunk which
suggests that there're still some problems. I've emailed the details of the
problem here a week ago.
For now, when I need to I split the files manually and use gzip before
moving them into the dfs within a specific directory and then load that
entire directory with pig.
I also tried to use lzo but had some problems with it. What I did see is
that lzo is faster than gzip but produces larger files.
As I understand the situation, pig can only write to bz files but read also
gz, lzo and zlib (handled by hadoop).
I originally wanted pig to write normal text files and have hadoop compress
the output to the other compression types (e.g. lzo), and I configured
hadoop as mentioned in the docs but still got an uncompressed output. If
anyone knows how to use this feature, please write.

Tamir


On Mon, Mar 16, 2009 at 5:10 PM, Alan Gates <ga...@yahoo-inc.com> wrote:

> I haven't worked extensively with compressed data, so I'll let others who
> have share their experience.  But pig does work with bzip data, which can be
> split.  PigStorage checks to see if the input or output file ends in .bz,
> and if so uses bzip to read/write the data.  There have been some bugs in
> this code, so you should make sure you have the top of trunk version as it's
> been fixed fairly recently.
>
> gzip files cannot be split, and if you gzip your whole file, you can't
> really use it with map/reduce or pig.  But, hadoop now supports compressing
> each block.  As I understand it lzo is preferred over gzip for this.  But
> when you use this, it works fine with pig because hadoop handles the
> (de)compression underneath pig.  You should be able to find info on how to
> do this on your cluster in the hadoop docs.
>
> Alan.
>
>
> On Mar 14, 2009, at 2:00 AM, Vadim Zaliva wrote:
>
> I am considering starting to use compression for data files I process
>> with PIG. I am using trunk version of PIG
>> on Hadoop-0.18.3. Uncompressed files are about 500Mb each, and I plan
>> to have total few dozen terabytes of uncompressed data.
>> DFS block size I am using is 96Mb.
>>
>> I am looking for a feedback on idea of using {B|G}ZIP compressed
>> files. Could PIG handle them? How would it affect splittng?
>> I have read somewhere that bzip files could be split, whereas gzip
>> could not. Could somebody confirm this?
>>
>> Thanks!
>>
>> Vadim
>>
>
>

RE: bzip/gzip

Posted by Benjamin Reed <br...@yahoo-inc.com>.
just to clarify gzip a little bit, gzip files cannot be split, so if you have gzipped input, you will get a map task for each file. it will still work, but if you don't have at least as many files as you have map task slots, you will lose some parallelism.

ben

-----Original Message-----
From: Alan Gates [mailto:gates@yahoo-inc.com] 
Sent: Monday, March 16, 2009 8:10 AM
To: pig-user@hadoop.apache.org
Subject: Re: bzip/gzip

I haven't worked extensively with compressed data, so I'll let others  
who have share their experience.  But pig does work with bzip data,  
which can be split.  PigStorage checks to see if the input or output  
file ends in .bz, and if so uses bzip to read/write the data.  There  
have been some bugs in this code, so you should make sure you have the  
top of trunk version as it's been fixed fairly recently.

gzip files cannot be split, and if you gzip your whole file, you can't  
really use it with map/reduce or pig.  But, hadoop now supports  
compressing each block.  As I understand it lzo is preferred over gzip  
for this.  But when you use this, it works fine with pig because  
hadoop handles the (de)compression underneath pig.  You should be able  
to find info on how to do this on your cluster in the hadoop docs.

Alan.

On Mar 14, 2009, at 2:00 AM, Vadim Zaliva wrote:

> I am considering starting to use compression for data files I process
> with PIG. I am using trunk version of PIG
> on Hadoop-0.18.3. Uncompressed files are about 500Mb each, and I plan
> to have total few dozen terabytes of uncompressed data.
> DFS block size I am using is 96Mb.
>
> I am looking for a feedback on idea of using {B|G}ZIP compressed
> files. Could PIG handle them? How would it affect splittng?
> I have read somewhere that bzip files could be split, whereas gzip
> could not. Could somebody confirm this?
>
> Thanks!
>
> Vadim


Re: bzip/gzip

Posted by Alan Gates <ga...@yahoo-inc.com>.
I haven't worked extensively with compressed data, so I'll let others  
who have share their experience.  But pig does work with bzip data,  
which can be split.  PigStorage checks to see if the input or output  
file ends in .bz, and if so uses bzip to read/write the data.  There  
have been some bugs in this code, so you should make sure you have the  
top of trunk version as it's been fixed fairly recently.

gzip files cannot be split, and if you gzip your whole file, you can't  
really use it with map/reduce or pig.  But, hadoop now supports  
compressing each block.  As I understand it lzo is preferred over gzip  
for this.  But when you use this, it works fine with pig because  
hadoop handles the (de)compression underneath pig.  You should be able  
to find info on how to do this on your cluster in the hadoop docs.

Alan.

On Mar 14, 2009, at 2:00 AM, Vadim Zaliva wrote:

> I am considering starting to use compression for data files I process
> with PIG. I am using trunk version of PIG
> on Hadoop-0.18.3. Uncompressed files are about 500Mb each, and I plan
> to have total few dozen terabytes of uncompressed data.
> DFS block size I am using is 96Mb.
>
> I am looking for a feedback on idea of using {B|G}ZIP compressed
> files. Could PIG handle them? How would it affect splittng?
> I have read somewhere that bzip files could be split, whereas gzip
> could not. Could somebody confirm this?
>
> Thanks!
>
> Vadim