You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Josh Ferguson <jo...@besquared.net> on 2008/12/02 10:09:21 UTC

Compression

Whatever happened to the compressed storage format? I'd like to keep  
delimited files in bz2 if possible to save on space, is that sort of  
thing being considered?

Josh

RE: Compression

Posted by Ashish Thusoo <at...@facebook.com>.

Can't we set up proper codecs for sequence files. 

Ashish
________________________________________
From: Josh Ferguson [josh@besquared.net]
Sent: Tuesday, December 02, 2008 1:37 AM
To: hive-user@hadoop.apache.org
Subject: Re: Compression

I'm not sure, from their wiki:

Compressed Input

Compressed files are difficult to process in parallel, since they cannot, in general, be split into fragments and independently decompressed. However, if the compression is block-oriented (e.g. bz2), the splitting and parallel processing is easy to do.

Pig has inbuilt support for processing .bz2 files in parallel (.gz support is coming soon). If the input file name extension is .bz2, Pig decompresses the file on the fly and passes the decompressed input stream to your load function. For example,

A = LOAD 'input.bz2' USING myLoad();

Multiple instances of myLoad() (as dictated by the degree of parallelism) will be created and each will be given a fragment of the *decompressed* version of input.bz2 to process.

On Dec 2, 2008, at 1:32 AM, Zheng Shao wrote:

Can you give a little more details?
For example, you tried a single .bz file as input, and the pig job has 2 or more mappers?

I didn't know bz2 was splittable.

Zheng
On Tue, Dec 2, 2008 at 1:18 AM, Josh Ferguson <jo...@besquared.net>> wrote:
It is splittable because of how the compression uses blocks, Pig does this out of the box.

Josh

On Dec 2, 2008, at 1:14 AM, Zheng Shao wrote:

It shouldn't be a problem for Hive to support it (by defining your own input/output file format that does the decompression on the flyer), but we won't be able to parallelize the execution as we do with uncompressed text files, and sequence files, since bz2 compression is not splittable.

--
Yours,
Zheng

RE: Compression

Posted by Ashish Thusoo <at...@facebook.com>.

Josh,

I don't think anything has changed in that code in the last couple of weeks. However, just to be safe can you start using the svn repo instead. We are going to discontinue the mirror soon. The svn repo compiles with 0.19 (though it does not compile with 0.17 yet).

Ashish

________________________________
From: Josh Ferguson [mailto:josh@besquared.net]
Sent: Tuesday, December 02, 2008 9:04 PM
To: hive-user@hadoop.apache.org
Subject: Re: Compression

I'm using that version along with hive 0.19 that I got a week or two ago from mirror.facebook.com

Josh

On Dec 2, 2008, at 8:59 PM, Zheng Shao wrote:

Yes. As Joydeep mentioned, splitting of bzip2 comes from hadoop 0.19.
What version of hadoop are you using?

Zheng

On Tue, Dec 2, 2008 at 8:45 PM, Josh Ferguson <jo...@besquared.net>> wrote:
It does indeed work, sorry for wasting your time. Is mapping these in parallel part of what hadoop offers?

Josh

On Dec 2, 2008, at 8:35 PM, Zheng Shao wrote:

Hi Josh,

Please file a jira. I tried this on our deployment and it worked.

create table tmp_zshao_t4 (a string, b string) stored as textfile;
load data inpath '/user/zshao/patch.txt.bz2' overwrite into table tmp_zshao_t4;

Please let us know the svn revision if you did "svn co" from apache, or the svn.version file in the gz file if you downloaded from mirror.facebook.com<http://mirror.facebook.com>.

Zheng
From: Joydeep Sen Sarma [mailto:jssarma@facebook.com]
Sent: Tuesday, December 02, 2008 7:57 PM
To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: RE: Compression

Please file a Jira in that case. this should work. There is a check to verify that the file type is consistent with the table storage format - but I believe this should only kick in for sequencefiles ..

________________________________
From: Josh Ferguson [mailto:josh@besquared.net]
Sent: Tuesday, December 02, 2008 7:55 PM
To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Compression

I'm pretty sure that when I tried to load a .bz2 file into a TEXTFILE type table using LOAD LOCAL DATA that it complained at me. I'll have to try it out again but I'm pretty sure it wasn't working.

Josh

On Dec 2, 2008, at 10:30 AM, Joydeep Sen Sarma wrote:

Yes - from the jiras - bz2 is splitable in hadoop-0.19.

Hive doesn't have to do anything to support this (although we haven't tested it). please mark ur tables as 'stored as textfile' (not sure if that's the default). As long as the file as bz2 extension and hadoop has the codec that matches that extension - hive will just rely on hadoop to open these files. We punt on hadoop to decide when/how to split files - so it should 'just work'.

But we never tested it :) - so please let us know if it actually worked out.

I am curious if there's a native codec for bz2 in hadoop? (java codecs are too slow)

________________________________
From: Josh Ferguson [mailto:josh@besquared.net]
Sent: Tuesday, December 02, 2008 1:38 AM
To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Compression

I'm not sure, from their wiki:

Compressed Input

Compressed files are difficult to process in parallel, since they cannot, in general, be split into fragments and independently decompressed. However, if the compression is block-oriented (e.g. bz2), the splitting and parallel processing is easy to do.

Pig has inbuilt support for processing .bz2 files in parallel (.gz support is coming soon). If the input file name extension is .bz2, Pig decompresses the file on the fly and passes the decompressed input stream to your load function. For example,

A = LOAD 'input.bz2' USING myLoad();

Multiple instances of myLoad() (as dictated by the degree of parallelism) will be created and each will be given a fragment of the *decompressed* version of input.bz2 to process.

On Dec 2, 2008, at 1:32 AM, Zheng Shao wrote:

Can you give a little more details?
For example, you tried a single .bz file as input, and the pig job has 2 or more mappers?

I didn't know bz2 was splittable.

Zheng
On Tue, Dec 2, 2008 at 1:18 AM, Josh Ferguson <jo...@besquared.net>> wrote:
It is splittable because of how the compression uses blocks, Pig does this out of the box.

Josh

On Dec 2, 2008, at 1:14 AM, Zheng Shao wrote:
It shouldn't be a problem for Hive to support it (by defining your own input/output file format that does the decompression on the flyer), but we won't be able to parallelize the execution as we do with uncompressed text files, and sequence files, since bz2 compression is not splittable.

--
Yours,
Zheng

--
Yours,
Zheng

Re: Compression

Posted by Josh Ferguson <jo...@besquared.net>.

I'm using that version along with hive 0.19 that I got a week or two  
ago from mirror.facebook.com

Josh

On Dec 2, 2008, at 8:59 PM, Zheng Shao wrote:

> Yes. As Joydeep mentioned, splitting of bzip2 comes from hadoop 0.19.
> What version of hadoop are you using?
>
> Zheng
>
> On Tue, Dec 2, 2008 at 8:45 PM, Josh Ferguson <jo...@besquared.net>  
> wrote:
> It does indeed work, sorry for wasting your time. Is mapping these  
> in parallel part of what hadoop offers?
>
> Josh
>
> On Dec 2, 2008, at 8:35 PM, Zheng Shao wrote:
>
>> Hi Josh,
>>
>> Please file a jira. I tried this on our deployment and it worked.
>>
>> create table tmp_zshao_t4 (a string, b string) stored as textfile;
>> load data inpath '/user/zshao/patch.txt.bz2' overwrite into table  
>> tmp_zshao_t4;
>>
>> Please let us know the svn revision if you did "svn co" from  
>> apache, or the svn.version file in the gz file if you downloaded  
>> from mirror.facebook.com.
>>
>> Zheng
>> From: Joydeep Sen Sarma [mailto:jssarma@facebook.com]
>> Sent: Tuesday, December 02, 2008 7:57 PM
>> To: hive-user@hadoop.apache.org
>> Subject: RE: Compression
>>
>> Please file a Jira in that case. this should work. There is a check  
>> to verify that the file type is consistent with the table storage  
>> format – but I believe this should only kick in for sequencefiles ..
>>
>> From: Josh Ferguson [mailto:josh@besquared.net]
>> Sent: Tuesday, December 02, 2008 7:55 PM
>> To: hive-user@hadoop.apache.org
>> Subject: Re: Compression
>>
>> I'm pretty sure that when I tried to load a .bz2 file into a  
>> TEXTFILE type table using LOAD LOCAL DATA that it complained at me.  
>> I'll have to try it out again but I'm pretty sure it wasn't working.
>>
>> Josh
>>
>> On Dec 2, 2008, at 10:30 AM, Joydeep Sen Sarma wrote:
>>
>> Yes – from the jiras – bz2 is splitable in hadoop-0.19.
>>
>> Hive doesn't have to do anything to support this (although we  
>> haven't tested it). please mark ur tables as 'stored as  
>> textfile' (not sure if that's the default). As long as the file as  
>> bz2 extension and hadoop has the codec that matches that extension  
>> – hive will just rely on hadoop to open these files. We punt on  
>> hadoop to decide when/how to split files – so it should 'just work'.
>>
>> But we never tested it J - so please let us know if it actually  
>> worked out.
>>
>> I am curious if there's a native codec for bz2 in hadoop? (java  
>> codecs are too slow)
>>
>> From: Josh Ferguson [mailto:josh@besquared.net]
>> Sent: Tuesday, December 02, 2008 1:38 AM
>> To: hive-user@hadoop.apache.org
>> Subject: Re: Compression
>>
>> I'm not sure, from their wiki:
>>
>> Compressed Input
>> Compressed files are difficult to process in parallel, since they  
>> cannot, in general, be split into fragments and independently  
>> decompressed. However, if the compression is block-oriented (e.g.  
>> bz2), the splitting and parallel processing is easy to do.
>>
>> Pig has inbuilt support for processing .bz2 files in parallel (.gz  
>> support is coming soon). If the input file name extension is .bz2,  
>> Pig decompresses the file on the fly and passes the decompressed  
>> input stream to your load function. For example,
>>
>> A = LOAD 'input.bz2' USING myLoad();
>> Multiple instances of myLoad() (as dictated by the degree of  
>> parallelism) will be created and each will be given a fragment of  
>> the *decompressed* version of input.bz2 to process.
>>
>>
>> On Dec 2, 2008, at 1:32 AM, Zheng Shao wrote:
>>
>>
>> Can you give a little more details?
>> For example, you tried a single .bz file as input, and the pig job  
>> has 2 or more mappers?
>>
>> I didn't know bz2 was splittable.
>>
>> Zheng
>> On Tue, Dec 2, 2008 at 1:18 AM, Josh Ferguson <jo...@besquared.net>  
>> wrote:
>> It is splittable because of how the compression uses blocks, Pig  
>> does this out of the box.
>>
>> Josh
>>
>>
>> On Dec 2, 2008, at 1:14 AM, Zheng Shao wrote:
>> It shouldn't be a problem for Hive to support it (by defining your  
>> own input/output file format that does the decompression on the  
>> flyer), but we won't be able to parallelize the execution as we do  
>> with uncompressed text files, and sequence files, since bz2  
>> compression is not splittable.
>>
>>
>>
>>
>> -- 
>> Yours,
>> Zheng
>>
>>
>
>
>
>
> -- 
> Yours,
> Zheng

Re: Compression

Posted by Zheng Shao <zs...@gmail.com>.

Yes. As Joydeep mentioned, splitting of bzip2 comes from hadoop 0.19.
What version of hadoop are you using?

Zheng

On Tue, Dec 2, 2008 at 8:45 PM, Josh Ferguson <jo...@besquared.net> wrote:

> It does indeed work, sorry for wasting your time. Is mapping these in
> parallel part of what hadoop offers?
> Josh
>
> On Dec 2, 2008, at 8:35 PM, Zheng Shao wrote:
>
> Hi Josh,
>
> Please file a jira. I tried this on our deployment and it worked.
>
> create table tmp_zshao_t4 (a string, b string) stored as textfile;
> load data inpath '/user/zshao/patch.txt.bz2' overwrite into table
> tmp_zshao_t4;
>
> Please let us know the svn revision if you did "svn co" from apache, or the
> svn.version file in the gz file if you downloaded from mirror.facebook.com
> .
>
> Zheng
> *From:* Joydeep Sen Sarma [mailto:jssarma@facebook.com<js...@facebook.com>
> ]
> *Sent:* Tuesday, December 02, 2008 7:57 PM
> *To:* hive-user@hadoop.apache.org
> *Subject:* RE: Compression
>
> Please file a Jira in that case. this should work. There is a check to
> verify that the file type is consistent with the table storage format – but
> I believe this should only kick in for sequencefiles ..
>
> ------------------------------
> *From:* Josh Ferguson [mailto:josh@besquared.net <jo...@besquared.net>]
> *Sent:* Tuesday, December 02, 2008 7:55 PM
> *To:* hive-user@hadoop.apache.org
> *Subject:* Re: Compression
>
> I'm pretty sure that when I tried to load a .bz2 file into a TEXTFILE type
> table using LOAD LOCAL DATA that it complained at me. I'll have to try it
> out again but I'm pretty sure it wasn't working.
>
> Josh
>
> On Dec 2, 2008, at 10:30 AM, Joydeep Sen Sarma wrote:
>
> Yes – from the jiras – bz2 is splitable in hadoop-0.19.
>
> Hive doesn't have to do anything to support this (although we haven't
> tested it). please mark ur tables as 'stored as textfile' (not sure if
> that's the default). As long as the file as bz2 extension and hadoop has the
> codec that matches that extension – hive will just rely on hadoop to open
> these files. We punt on hadoop to decide when/how to split files – so it
> should 'just work'.
>
> But we never tested it J - so please let us know if it actually worked
> out.
>
> I am curious if there's a native codec for bz2 in hadoop? (java codecs are
> too slow)
>
> ------------------------------
> *From:* Josh Ferguson [mailto:josh@besquared.net <jo...@besquared.net>]
> *Sent:* Tuesday, December 02, 2008 1:38 AM
> *To:* hive-user@hadoop.apache.org
> *Subject:* Re: Compression
>
> I'm not sure, from their wiki:
>
> Compressed Input
>
> Compressed files are difficult to process in parallel, since they cannot,
> in general, be split into fragments and independently decompressed. However,
> if the compression is block-oriented (e.g. bz2), the splitting and parallel
> processing is easy to do.
>
> Pig has inbuilt support for processing .bz2 files in parallel (.gz support
> is coming soon). If the input file name extension is .bz2, Pig decompresses
> the file on the fly and passes the decompressed input stream to your load
> function. For example,
>
> A = LOAD 'input.bz2' USING myLoad();
>
> Multiple instances of myLoad() (as dictated by the degree of parallelism)
> will be created and each will be given a fragment of the *decompressed*
> version of input.bz2 to process.
>
> On Dec 2, 2008, at 1:32 AM, Zheng Shao wrote:
>
>
> Can you give a little more details?
> For example, you tried a single .bz file as input, and the pig job has 2 or
> more mappers?
>
> I didn't know bz2 was splittable.
>
> Zheng
> On Tue, Dec 2, 2008 at 1:18 AM, Josh Ferguson <jo...@besquared.net> wrote:
> It is splittable because of how the compression uses blocks, Pig does this
> out of the box.
>
> Josh
>
>
> On Dec 2, 2008, at 1:14 AM, Zheng Shao wrote:
> It shouldn't be a problem for Hive to support it (by defining your own
> input/output file format that does the decompression on the flyer), but we
> won't be able to parallelize the execution as we do with uncompressed text
> files, and sequence files, since bz2 compression is not splittable.
>
>
>
>
> --
> Yours,
> Zheng
>
>
>
>
>


-- 
Yours,
Zheng

Re: Compression

Posted by Josh Ferguson <jo...@besquared.net>.

It does indeed work, sorry for wasting your time. Is mapping these in  
parallel part of what hadoop offers?

Josh

On Dec 2, 2008, at 8:35 PM, Zheng Shao wrote:

> Hi Josh,
>
> Please file a jira. I tried this on our deployment and it worked.
>
> create table tmp_zshao_t4 (a string, b string) stored as textfile;
> load data inpath '/user/zshao/patch.txt.bz2' overwrite into table  
> tmp_zshao_t4;
>
> Please let us know the svn revision if you did “svn co” from apache,  
> or the svn.version file in the gz file if you downloaded from  
> mirror.facebook.com.
>
> Zheng
> From: Joydeep Sen Sarma [mailto:jssarma@facebook.com]
> Sent: Tuesday, December 02, 2008 7:57 PM
> To: hive-user@hadoop.apache.org
> Subject: RE: Compression
>
> Please file a Jira in that case. this should work. There is a check  
> to verify that the file type is consistent with the table storage  
> format – but I believe this should only kick in for sequencefiles ..
>
> From: Josh Ferguson [mailto:josh@besquared.net]
> Sent: Tuesday, December 02, 2008 7:55 PM
> To: hive-user@hadoop.apache.org
> Subject: Re: Compression
>
> I'm pretty sure that when I tried to load a .bz2 file into a  
> TEXTFILE type table using LOAD LOCAL DATA that it complained at me.  
> I'll have to try it out again but I'm pretty sure it wasn't working.
>
> Josh
>
> On Dec 2, 2008, at 10:30 AM, Joydeep Sen Sarma wrote:
>
> Yes – from the jiras – bz2 is splitable in hadoop-0.19.
>
> Hive doesn’t have to do anything to support this (although we  
> haven’t tested it). please mark ur tables as ‘stored as  
> textfile’ (not sure if that’s the default). As long as the file as  
> bz2 extension and hadoop has the codec that matches that extension –  
> hive will just rely on hadoop to open these files. We punt on hadoop  
> to decide when/how to split files – so it should ‘just work’.
>
> But we never tested it J - so please let us know if it actually  
> worked out.
>
> I am curious if there’s a native codec for bz2 in hadoop? (java  
> codecs are too slow)
>
> From: Josh Ferguson [mailto:josh@besquared.net]
> Sent: Tuesday, December 02, 2008 1:38 AM
> To: hive-user@hadoop.apache.org
> Subject: Re: Compression
>
> I'm not sure, from their wiki:
>
> Compressed Input
> Compressed files are difficult to process in parallel, since they  
> cannot, in general, be split into fragments and independently  
> decompressed. However, if the compression is block-oriented (e.g.  
> bz2), the splitting and parallel processing is easy to do.
>
> Pig has inbuilt support for processing .bz2 files in parallel (.gz  
> support is coming soon). If the input file name extension is .bz2,  
> Pig decompresses the file on the fly and passes the decompressed  
> input stream to your load function. For example,
>
> A = LOAD 'input.bz2' USING myLoad();
> Multiple instances of myLoad() (as dictated by the degree of  
> parallelism) will be created and each will be given a fragment of  
> the *decompressed* version of input.bz2 to process.
>
>
> On Dec 2, 2008, at 1:32 AM, Zheng Shao wrote:
>
>
> Can you give a little more details?
> For example, you tried a single .bz file as input, and the pig job  
> has 2 or more mappers?
>
> I didn't know bz2 was splittable.
>
> Zheng
> On Tue, Dec 2, 2008 at 1:18 AM, Josh Ferguson <jo...@besquared.net>  
> wrote:
> It is splittable because of how the compression uses blocks, Pig  
> does this out of the box.
>
> Josh
>
>
> On Dec 2, 2008, at 1:14 AM, Zheng Shao wrote:
> It shouldn't be a problem for Hive to support it (by defining your  
> own input/output file format that does the decompression on the  
> flyer), but we won't be able to parallelize the execution as we do  
> with uncompressed text files, and sequence files, since bz2  
> compression is not splittable.
>
>
>
>
> -- 
> Yours,
> Zheng
>
>

RE: Compression

Posted by Zheng Shao <zs...@facebook.com>.

Hi Josh,

Please file a jira. I tried this on our deployment and it worked.

create table tmp_zshao_t4 (a string, b string) stored as textfile;
load data inpath '/user/zshao/patch.txt.bz2' overwrite into table tmp_zshao_t4;

Please let us know the svn revision if you did "svn co" from apache, or the svn.version file in the gz file if you downloaded from mirror.facebook.com.

Zheng
From: Joydeep Sen Sarma [mailto:jssarma@facebook.com]
Sent: Tuesday, December 02, 2008 7:57 PM
To: hive-user@hadoop.apache.org
Subject: RE: Compression

Please file a Jira in that case. this should work. There is a check to verify that the file type is consistent with the table storage format - but I believe this should only kick in for sequencefiles ..

________________________________
From: Josh Ferguson [mailto:josh@besquared.net]
Sent: Tuesday, December 02, 2008 7:55 PM
To: hive-user@hadoop.apache.org
Subject: Re: Compression

I'm pretty sure that when I tried to load a .bz2 file into a TEXTFILE type table using LOAD LOCAL DATA that it complained at me. I'll have to try it out again but I'm pretty sure it wasn't working.

Josh

On Dec 2, 2008, at 10:30 AM, Joydeep Sen Sarma wrote:

Yes - from the jiras - bz2 is splitable in hadoop-0.19.

Hive doesn't have to do anything to support this (although we haven't tested it). please mark ur tables as 'stored as textfile' (not sure if that's the default). As long as the file as bz2 extension and hadoop has the codec that matches that extension - hive will just rely on hadoop to open these files. We punt on hadoop to decide when/how to split files - so it should 'just work'.

But we never tested it :) - so please let us know if it actually worked out.

I am curious if there's a native codec for bz2 in hadoop? (java codecs are too slow)

________________________________
From: Josh Ferguson [mailto:josh@besquared.net]
Sent: Tuesday, December 02, 2008 1:38 AM
To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Compression

I'm not sure, from their wiki:

Compressed Input

Compressed files are difficult to process in parallel, since they cannot, in general, be split into fragments and independently decompressed. However, if the compression is block-oriented (e.g. bz2), the splitting and parallel processing is easy to do.

Pig has inbuilt support for processing .bz2 files in parallel (.gz support is coming soon). If the input file name extension is .bz2, Pig decompresses the file on the fly and passes the decompressed input stream to your load function. For example,

A = LOAD 'input.bz2' USING myLoad();

Multiple instances of myLoad() (as dictated by the degree of parallelism) will be created and each will be given a fragment of the *decompressed* version of input.bz2 to process.

On Dec 2, 2008, at 1:32 AM, Zheng Shao wrote:


Can you give a little more details?
For example, you tried a single .bz file as input, and the pig job has 2 or more mappers?

I didn't know bz2 was splittable.

Zheng
On Tue, Dec 2, 2008 at 1:18 AM, Josh Ferguson <jo...@besquared.net>> wrote:
It is splittable because of how the compression uses blocks, Pig does this out of the box.

Josh


On Dec 2, 2008, at 1:14 AM, Zheng Shao wrote:
It shouldn't be a problem for Hive to support it (by defining your own input/output file format that does the decompression on the flyer), but we won't be able to parallelize the execution as we do with uncompressed text files, and sequence files, since bz2 compression is not splittable.




--
Yours,
Zheng

RE: Compression

Posted by Joydeep Sen Sarma <js...@facebook.com>.

Please file a Jira in that case. this should work. There is a check to verify that the file type is consistent with the table storage format - but I believe this should only kick in for sequencefiles ..

________________________________
From: Josh Ferguson [mailto:josh@besquared.net]
Sent: Tuesday, December 02, 2008 7:55 PM
To: hive-user@hadoop.apache.org
Subject: Re: Compression

I'm pretty sure that when I tried to load a .bz2 file into a TEXTFILE type table using LOAD LOCAL DATA that it complained at me. I'll have to try it out again but I'm pretty sure it wasn't working.

Josh

On Dec 2, 2008, at 10:30 AM, Joydeep Sen Sarma wrote:

Yes - from the jiras - bz2 is splitable in hadoop-0.19.

Hive doesn't have to do anything to support this (although we haven't tested it). please mark ur tables as 'stored as textfile' (not sure if that's the default). As long as the file as bz2 extension and hadoop has the codec that matches that extension - hive will just rely on hadoop to open these files. We punt on hadoop to decide when/how to split files - so it should 'just work'.

But we never tested it :) - so please let us know if it actually worked out.

I am curious if there's a native codec for bz2 in hadoop? (java codecs are too slow)

________________________________
From: Josh Ferguson [mailto:josh@besquared.net]
Sent: Tuesday, December 02, 2008 1:38 AM
To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Compression

I'm not sure, from their wiki:

Compressed Input

Compressed files are difficult to process in parallel, since they cannot, in general, be split into fragments and independently decompressed. However, if the compression is block-oriented (e.g. bz2), the splitting and parallel processing is easy to do.

Pig has inbuilt support for processing .bz2 files in parallel (.gz support is coming soon). If the input file name extension is .bz2, Pig decompresses the file on the fly and passes the decompressed input stream to your load function. For example,

A = LOAD 'input.bz2' USING myLoad();

Multiple instances of myLoad() (as dictated by the degree of parallelism) will be created and each will be given a fragment of the *decompressed* version of input.bz2 to process.

On Dec 2, 2008, at 1:32 AM, Zheng Shao wrote:

Can you give a little more details?
For example, you tried a single .bz file as input, and the pig job has 2 or more mappers?

I didn't know bz2 was splittable.

Zheng
On Tue, Dec 2, 2008 at 1:18 AM, Josh Ferguson <jo...@besquared.net>> wrote:
It is splittable because of how the compression uses blocks, Pig does this out of the box.

Josh

On Dec 2, 2008, at 1:14 AM, Zheng Shao wrote:
It shouldn't be a problem for Hive to support it (by defining your own input/output file format that does the decompression on the flyer), but we won't be able to parallelize the execution as we do with uncompressed text files, and sequence files, since bz2 compression is not splittable.

--
Yours,
Zheng

Re: Compression

Posted by Josh Ferguson <jo...@besquared.net>.

I'm pretty sure that when I tried to load a .bz2 file into a TEXTFILE  
type table using LOAD LOCAL DATA that it complained at me. I'll have  
to try it out again but I'm pretty sure it wasn't working.

Josh

On Dec 2, 2008, at 10:30 AM, Joydeep Sen Sarma wrote:

> Yes – from the jiras – bz2 is splitable in hadoop-0.19.
>
> Hive doesn’t have to do anything to support this (although we  
> haven’t tested it). please mark ur tables as ‘stored as  
> textfile’ (not sure if that’s the default). As long as the file as  
> bz2 extension and hadoop has the codec that matches that extension –  
> hive will just rely on hadoop to open these files. We punt on hadoop  
> to decide when/how to split files – so it should ‘just work’.
>
> But we never tested it J - so please let us know if it actually  
> worked out.
>
> I am curious if there’s a native codec for bz2 in hadoop? (java  
> codecs are too slow)
>
> From: Josh Ferguson [mailto:josh@besquared.net]
> Sent: Tuesday, December 02, 2008 1:38 AM
> To: hive-user@hadoop.apache.org
> Subject: Re: Compression
>
> I'm not sure, from their wiki:
>
> Compressed Input
> Compressed files are difficult to process in parallel, since they  
> cannot, in general, be split into fragments and independently  
> decompressed. However, if the compression is block-oriented (e.g.  
> bz2), the splitting and parallel processing is easy to do.
>
> Pig has inbuilt support for processing .bz2 files in parallel (.gz  
> support is coming soon). If the input file name extension is .bz2,  
> Pig decompresses the file on the fly and passes the decompressed  
> input stream to your load function. For example,
>
> A = LOAD 'input.bz2' USING myLoad();
> Multiple instances of myLoad() (as dictated by the degree of  
> parallelism) will be created and each will be given a fragment of  
> the *decompressed* version of input.bz2 to process.
>
>
> On Dec 2, 2008, at 1:32 AM, Zheng Shao wrote:
>
>
> Can you give a little more details?
> For example, you tried a single .bz file as input, and the pig job  
> has 2 or more mappers?
>
> I didn't know bz2 was splittable.
>
> Zheng
> On Tue, Dec 2, 2008 at 1:18 AM, Josh Ferguson <jo...@besquared.net>  
> wrote:
> It is splittable because of how the compression uses blocks, Pig  
> does this out of the box.
>
> Josh
>
>
> On Dec 2, 2008, at 1:14 AM, Zheng Shao wrote:
> It shouldn't be a problem for Hive to support it (by defining your  
> own input/output file format that does the decompression on the  
> flyer), but we won't be able to parallelize the execution as we do  
> with uncompressed text files, and sequence files, since bz2  
> compression is not splittable.
>
>
>
>
> -- 
> Yours,
> Zheng
>

RE: Compression

Posted by Joydeep Sen Sarma <js...@facebook.com>.

Yes - from the jiras - bz2 is splitable in hadoop-0.19.

Hive doesn't have to do anything to support this (although we haven't tested it). please mark ur tables as 'stored as textfile' (not sure if that's the default). As long as the file as bz2 extension and hadoop has the codec that matches that extension - hive will just rely on hadoop to open these files. We punt on hadoop to decide when/how to split files - so it should 'just work'.

But we never tested it :) - so please let us know if it actually worked out.

I am curious if there's a native codec for bz2 in hadoop? (java codecs are too slow)

________________________________
From: Josh Ferguson [mailto:josh@besquared.net]
Sent: Tuesday, December 02, 2008 1:38 AM
To: hive-user@hadoop.apache.org
Subject: Re: Compression

I'm not sure, from their wiki:

Compressed Input

Compressed files are difficult to process in parallel, since they cannot, in general, be split into fragments and independently decompressed. However, if the compression is block-oriented (e.g. bz2), the splitting and parallel processing is easy to do.

Pig has inbuilt support for processing .bz2 files in parallel (.gz support is coming soon). If the input file name extension is .bz2, Pig decompresses the file on the fly and passes the decompressed input stream to your load function. For example,

A = LOAD 'input.bz2' USING myLoad();

Multiple instances of myLoad() (as dictated by the degree of parallelism) will be created and each will be given a fragment of the *decompressed* version of input.bz2 to process.

On Dec 2, 2008, at 1:32 AM, Zheng Shao wrote:


Can you give a little more details?
For example, you tried a single .bz file as input, and the pig job has 2 or more mappers?

I didn't know bz2 was splittable.

Zheng
On Tue, Dec 2, 2008 at 1:18 AM, Josh Ferguson <jo...@besquared.net>> wrote:
It is splittable because of how the compression uses blocks, Pig does this out of the box.

Josh


On Dec 2, 2008, at 1:14 AM, Zheng Shao wrote:
It shouldn't be a problem for Hive to support it (by defining your own input/output file format that does the decompression on the flyer), but we won't be able to parallelize the execution as we do with uncompressed text files, and sequence files, since bz2 compression is not splittable.




--
Yours,
Zheng

Re: Compression

Posted by Josh Ferguson <jo...@besquared.net>.

I'm not sure, from their wiki:

Compressed Input
Compressed files are difficult to process in parallel, since they  
cannot, in general, be split into fragments and independently  
decompressed. However, if the compression is block-oriented (e.g.  
bz2), the splitting and parallel processing is easy to do.

Pig has inbuilt support for processing .bz2 files in parallel (.gz  
support is coming soon). If the input file name extension is .bz2, Pig  
decompresses the file on the fly and passes the decompressed input  
stream to your load function. For example,

A = LOAD 'input.bz2' USING myLoad();
Multiple instances of myLoad() (as dictated by the degree of  
parallelism) will be created and each will be given a fragment of the  
*decompressed* version of input.bz2 to process.

On Dec 2, 2008, at 1:32 AM, Zheng Shao wrote:

> Can you give a little more details?
> For example, you tried a single .bz file as input, and the pig job  
> has 2 or more mappers?
>
> I didn't know bz2 was splittable.
>
> Zheng
> On Tue, Dec 2, 2008 at 1:18 AM, Josh Ferguson <jo...@besquared.net>  
> wrote:
> It is splittable because of how the compression uses blocks, Pig  
> does this out of the box.
>
> Josh
>
>
> On Dec 2, 2008, at 1:14 AM, Zheng Shao wrote:
>
> It shouldn't be a problem for Hive to support it (by defining your  
> own input/output file format that does the decompression on the  
> flyer), but we won't be able to parallelize the execution as we do  
> with uncompressed text files, and sequence files, since bz2  
> compression is not splittable.
>
>
>
>
> -- 
> Yours,
> Zheng

Re: Compression

Posted by Zheng Shao <zs...@gmail.com>.

Can you give a little more details?
For example, you tried a single .bz file as input, and the pig job has 2 or
more mappers?

I didn't know bz2 was splittable.

Zheng
On Tue, Dec 2, 2008 at 1:18 AM, Josh Ferguson <jo...@besquared.net> wrote:

> It is splittable because of how the compression uses blocks, Pig does this
> out of the box.
>
> Josh
>
>
> On Dec 2, 2008, at 1:14 AM, Zheng Shao wrote:
>
>  It shouldn't be a problem for Hive to support it (by defining your own
>> input/output file format that does the decompression on the flyer), but we
>> won't be able to parallelize the execution as we do with uncompressed text
>> files, and sequence files, since bz2 compression is not splittable.
>>
>
>

-- 
Yours,
Zheng

Re: Compression

Posted by Josh Ferguson <jo...@besquared.net>.

It is splittable because of how the compression uses blocks, Pig does  
this out of the box.

Josh

On Dec 2, 2008, at 1:14 AM, Zheng Shao wrote:

> It shouldn't be a problem for Hive to support it (by defining your  
> own input/output file format that does the decompression on the  
> flyer), but we won't be able to parallelize the execution as we do  
> with uncompressed text files, and sequence files, since bz2  
> compression is not splittable.

Re: Compression

Posted by Zheng Shao <zs...@gmail.com>.

It shouldn't be a problem for Hive to support it (by defining your own
input/output file format that does the decompression on the flyer), but we
won't be able to parallelize the execution as we do with uncompressed text
files, and sequence files, since bz2 compression is not splittable.

So a better solution is to store the data in compressed sequence file
format, which saves space, and is also splittable.

Zheng

On Tue, Dec 2, 2008 at 1:09 AM, Josh Ferguson <jo...@besquared.net> wrote:

> Whatever happened to the compressed storage format? I'd like to keep
> delimited files in bz2 if possible to save on space, is that sort of thing
> being considered?
>
> Josh
>

-- 
Yours,
Zheng