You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Bill Craig <bc...@gmail.com> on 2009/07/21 17:06:07 UTC

bz2 Splits.

I loaded 5 files of bzip2 compressed data into a table in Hive. Three
are small test files containing 10,000 records. Two were large ~8Gb
compressed.
When I run a query against the table I see three tasks that complete
almost immediately and two tasks that run for a very long time. It
appears to me that
Hive/Hadoop is not splitting the input of the *.bz2. I have seen some
old mails about this, but could not find any resolution for this
problem. I compressed the files
using the Apache bz2 jar, the file are named *.bz2. I am using Hadoop
0.19.1 r745977

Re: Re: bz2 Splits.

Posted by Saurabh Nanda <sa...@gmail.com>.
One last question here. If both, TextFile and SequenceFile can be
compressed, then what's the advantage of the SequenceFile format?

Is it that a compressed file can be split into chunks only if it is stored
as a SequenceFile?

Saurabh.

On Sat, Jul 25, 2009 at 4:14 PM, Zheng Shao <zs...@gmail.com> wrote:

> Both TextFile and SequenceFile can be compressed or uncompressed.
>
> TextFile means the plain text file (records delimited by "\n").
> Compressed TextFiles are just text files compressed by gzip or bzip2
> utility.
> SequenceFile is a special file format that only Hadoop can understand.
>
> Since your files are compressed TextFiles, you have to create a table
> with TextFile format, in order to load the data without any
> conversion.
> (Compression is detected automatically for both TextFile and
> SequenceFile - you don't need to specify it when creating a table)
>
>
> Does this make the things a bit clearer?
>
> Zheng
>
> On Sat, Jul 25, 2009 at 3:27 AM, Saurabh Nanda<sa...@gmail.com>
> wrote:
> >
> >> If you want to load data (in compressed/uncompressed text format) into
> >> a table, you have to defined the table as "stored as textfile" instead
> >> of "stored as sequencefile".
> >
> > I'm completely confused right now. If sequencefiles are not used for
> > compressed data storage then what are they used for?
> >
> > If I have a gz file, and I want to import it as is (without gunzipping or
> > using an intermediate table), what should I be doing?
> >
> > Saurabh.
> > --
> > http://nandz.blogspot.com
> > http://foodieforlife.blogspot.com
> >
>
>
>
> --
> Yours,
> Zheng
>



-- 
http://nandz.blogspot.com
http://foodieforlife.blogspot.com

Re: Re: bz2 Splits.

Posted by Saurabh Nanda <sa...@gmail.com>.
> Can you help put that information into appropriate place on the wiki
> (where you see fit)?
> Thanks for the help.



http://wiki.apache.org/hadoop/CompressedStorage (please QC and correct where
wrong)
http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL?action=diff
http://wiki.apache.org/hadoop/Hive/LanguageManual/DML?action=diff

Saurabh.
-- 
http://nandz.blogspot.com
http://foodieforlife.blogspot.com

Re: Re: bz2 Splits.

Posted by Zheng Shao <zs...@gmail.com>.
If you follow Approach #3, you should have 8 big compressed
sequencefiles instead of 126 small files.

By the way, you probably didn't set the compression type to BLOCK
compression, otherwise sequencefile compression won't perform like
that.

Try setting up this in your hive-site.xml or hadoop-site.xml:

<property>
<name>io.seqfile.compression.type</name>
<value>BLOCK</value>
</property>

See http://blog.foofactory.fi/2006/12/my-fellow-nutch-developer-andrzej.html


Zheng


On Sun, Jul 26, 2009 at 10:05 PM, Saurabh Nanda<sa...@gmail.com> wrote:
>
>> Can you help put that information into appropriate place on the wiki
>> (where you see fit)?
>> Thanks for the help.
>
> Will do.
>
>>
>> By the way, I guess we need to debug what went wrong with the
>> "count(1)" queries. There is definitely something going wrong.
>
> My bad here. I think I forgot to import some files when running the queries
> earlier. The counts are exactly the same. However the timings for "select
> count(1)" queries are very different.
>
> #1 Uncompressed logs in textfile tables: 106sec (filesize of 7,686 MB over 8
> uncompressed files)
> #2 Compressed logs in textfile tables: 60sec (filesize of 736 MB over 8
> compressed files)
> #3 Compressed logs in sequencefile tables: 101sec (filesize of 4,773 MB over
> 126 compressed files)
>
>
>>
>> For the timing, how much mapper slots do you have in your cluster?
>
> I have a 4-node cluster with mapred.reduce.tasks=17 Is that what you mean by
> mapper slots?
>
>>
>> Approach #3:
>> a) import gzip files into textfile table
>> b) set hive.exec.compress.output to true
>> c) inserted into sequencefile table
>> This will create bigger sequencefiles which will help reducing the
>> overhead. This is better than Approach #2 because jobs from the
>> sequencefile tables will have more mappers.
>
> This is exactly what I did in #3 above. But, from those benchmarks #2 seems
> to give the best results, both, in terms of file size and speed. Is that not
> what you were expecting?
>
> Saurabh.
> --
> http://nandz.blogspot.com
> http://foodieforlife.blogspot.com
>



-- 
Yours,
Zheng

Re: bz2 Splits.

Posted by Saurabh Nanda <sa...@gmail.com>.
>  Sequence Block compression happens on smaller chunks (around 1MB I think)
> so the compression ration would be smaller than compressing complete file.
>


Is there a configuration parameter which controls this? Is it
io.seqfile.compress.blocksize? It was set to 1,000,000 in
hadoop-default.xml, which is approx 1MB.

Saurabh.
-- 
http://nandz.blogspot.com
http://foodieforlife.blogspot.com

Re: Re: bz2 Splits.

Posted by Saurabh Nanda <sa...@gmail.com>.
> That you for the wiki page on this. Keep up the good work and please
> post all your findings about compression. Many people (including me)
> will benefit  from an explanation about the different types of
> compression available and the trade offs of different codecs and
> options.



Thanks, Edward. I''m glad that it helped someone.

Saurabh.
-- 
http://nandz.blogspot.com
http://foodieforlife.blogspot.com

Re: Re: bz2 Splits.

Posted by Edward Capriolo <ed...@gmail.com>.
On Tue, Jul 28, 2009 at 11:02 AM, Edward Capriolo<ed...@gmail.com> wrote:
> On Tue, Jul 28, 2009 at 2:22 AM, Zheng Shao<zs...@gmail.com> wrote:
>> Yes we do compress all tables.
>>
>> Zheng
>>
>> On Mon, Jul 27, 2009 at 11:08 PM, Saurabh Nanda<sa...@gmail.com> wrote:
>>>
>>>> In our setup, we didn't change io.seqfile.compress.blocksize (1MB) and
>>>> it's still fairly good.
>>>> You are free to try 100MB for better compression ratio, but I would
>>>> recommend to keep the default setting to minimize the possibilities of
>>>> hitting unknown bugs.
>>>
>>> Makes sense. Better compression brought down a count(1) query from 100+ sec
>>> down to 40sec. The ETL phase is now taking 510sec as opposed to 700sec
>>> earlier.
>>>
>>> Do you also compress all tables, not just the raw ones? Would you recommend
>>> it?
>>>
>>> Saurabh.
>>> --
>>> http://nandz.blogspot.com
>>> http://foodieforlife.blogspot.com
>>>
>>
>>
>>
>> --
>> Yours,
>> Zheng
>>
>
> Saurabh,
>
> That you for the wiki page on this. Keep up the good work and please
> post all your findings about compression. Many people (including me)
> will benefit  from an explanation about the different types of
> compression available and the trade offs of different codecs and
> options. I am really excited as I have (shamefully ) had some large
> tables with multiple text files building up, and the thought of
> smaller data and faster queries is giving me goosebumps.
>
> Edward
>

On a related note..
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException:
java.lang.IllegalArgumentException: SequenceFile doesn't work with
GzipCodec without native-hadoop code!
:(
I have an 18.3 (cloudera) system in production.
hadoop-native-0.18.3-7.cloudera.CH0_3.i386.rpm
Is there any java based codec I could use that does not require
external native libraries?

Re: Re: bz2 Splits.

Posted by Edward Capriolo <ed...@gmail.com>.
On Tue, Jul 28, 2009 at 2:22 AM, Zheng Shao<zs...@gmail.com> wrote:
> Yes we do compress all tables.
>
> Zheng
>
> On Mon, Jul 27, 2009 at 11:08 PM, Saurabh Nanda<sa...@gmail.com> wrote:
>>
>>> In our setup, we didn't change io.seqfile.compress.blocksize (1MB) and
>>> it's still fairly good.
>>> You are free to try 100MB for better compression ratio, but I would
>>> recommend to keep the default setting to minimize the possibilities of
>>> hitting unknown bugs.
>>
>> Makes sense. Better compression brought down a count(1) query from 100+ sec
>> down to 40sec. The ETL phase is now taking 510sec as opposed to 700sec
>> earlier.
>>
>> Do you also compress all tables, not just the raw ones? Would you recommend
>> it?
>>
>> Saurabh.
>> --
>> http://nandz.blogspot.com
>> http://foodieforlife.blogspot.com
>>
>
>
>
> --
> Yours,
> Zheng
>

Saurabh,

That you for the wiki page on this. Keep up the good work and please
post all your findings about compression. Many people (including me)
will benefit  from an explanation about the different types of
compression available and the trade offs of different codecs and
options. I am really excited as I have (shamefully ) had some large
tables with multiple text files building up, and the thought of
smaller data and faster queries is giving me goosebumps.

Edward

Re: Re: bz2 Splits.

Posted by Zheng Shao <zs...@gmail.com>.
Yes we do compress all tables.

Zheng

On Mon, Jul 27, 2009 at 11:08 PM, Saurabh Nanda<sa...@gmail.com> wrote:
>
>> In our setup, we didn't change io.seqfile.compress.blocksize (1MB) and
>> it's still fairly good.
>> You are free to try 100MB for better compression ratio, but I would
>> recommend to keep the default setting to minimize the possibilities of
>> hitting unknown bugs.
>
> Makes sense. Better compression brought down a count(1) query from 100+ sec
> down to 40sec. The ETL phase is now taking 510sec as opposed to 700sec
> earlier.
>
> Do you also compress all tables, not just the raw ones? Would you recommend
> it?
>
> Saurabh.
> --
> http://nandz.blogspot.com
> http://foodieforlife.blogspot.com
>



-- 
Yours,
Zheng

Re: Re: bz2 Splits.

Posted by Saurabh Nanda <sa...@gmail.com>.
> In our setup, we didn't change io.seqfile.compress.blocksize (1MB) and
> it's still fairly good.
> You are free to try 100MB for better compression ratio, but I would
> recommend to keep the default setting to minimize the possibilities of
> hitting unknown bugs.


Makes sense. Better compression brought down a count(1) query from 100+ sec
down to 40sec. The ETL phase is now taking 510sec as opposed to 700sec
earlier.

Do you also compress all tables, not just the raw ones? Would you recommend
it?

Saurabh.
-- 
http://nandz.blogspot.com
http://foodieforlife.blogspot.com

Re: Re: bz2 Splits.

Posted by Zheng Shao <zs...@gmail.com>.
In our setup, we didn't change io.seqfile.compress.blocksize (1MB) and
it's still fairly good.
You are free to try 100MB for better compression ratio, but I would
recommend to keep the default setting to minimize the possibilities of
hitting unknown bugs.


Zheng

On Mon, Jul 27, 2009 at 10:38 PM, Saurabh Nanda<sa...@gmail.com> wrote:
>
>> The right configuration parameter is:
>> set mapred.output.compression.type=BLOCK;
>
> I've set mapred.output.compression.type and changed
> io.seqfile.compress.blocksize to 100,000,000 (100MB) and now 3600 MB files
> are down to 260MB!
>
> Is such high compression recommended?
>
> Saurabh.
> --
> http://nandz.blogspot.com
> http://foodieforlife.blogspot.com
>



-- 
Yours,
Zheng

Re: Re: bz2 Splits.

Posted by Saurabh Nanda <sa...@gmail.com>.
> The right configuration parameter is:
> set mapred.output.compression.type=BLOCK;


I've set mapred.output.compression.type and changed
io.seqfile.compress.blocksize to 100,000,000 (100MB) and now 3600 MB files
are down to 260MB!

Is such high compression recommended?

Saurabh.
-- 
http://nandz.blogspot.com
http://foodieforlife.blogspot.com

Re: Re: bz2 Splits.

Posted by Zheng Shao <zs...@gmail.com>.
Hi Saurabh,

The right configuration parameter is:
set mapred.output.compression.type=BLOCK;

Sorry about pointing you to the wrong configuration parameter.

Zheng

On Mon, Jul 27, 2009 at 10:02 PM, Saurabh Nanda<sa...@gmail.com> wrote:
>
>> The 1600MB number looks like record-level compression. Are you sure
>> you've turned on block compression?
>
> Here's the exact snippet from my shell script. Do I have to set these
> configuration parameters directly in the hadoop configuration file:
>
>     ${HIVE_COMMAND} -e "set hive.exec.compress.output=true; set
> io.seqfile.compression.type=BLOCK; set
> mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec; set
> io.seqfile.compress.blocksize=50000000; insert overwrite table
> raw_compressed partition(dt='${D}') select line from raw where dt='${D}'"
>
> Saurabh.
> --
> http://nandz.blogspot.com
> http://foodieforlife.blogspot.com
>



-- 
Yours,
Zheng

Re: Re: bz2 Splits.

Posted by Saurabh Nanda <sa...@gmail.com>.
> The 1600MB number looks like record-level compression. Are you sure
> you've turned on block compression?


Here's the exact snippet from my shell script. Do I have to set these
configuration parameters directly in the hadoop configuration file:

    ${HIVE_COMMAND} -e "set hive.exec.compress.output=true; set
io.seqfile.compression.type=BLOCK; set
mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec; set
io.seqfile.compress.blocksize=50000000; insert overwrite table
raw_compressed partition(dt='${D}') select line from raw where dt='${D}'"

Saurabh.
-- 
http://nandz.blogspot.com
http://foodieforlife.blogspot.com

Re: Re: bz2 Splits.

Posted by Zheng Shao <zs...@gmail.com>.
I cannot imagine there is such a huge compression ratio difference. On
our side, the compression ratio of gzip and GzipCodec (BLOCK) are
within 10% relative difference.
Log file compression ratio is usually 5x to 15x, so 250MB looks like a good one.

The 1600MB number looks like record-level compression. Are you sure
you've turned on block compression?

Zheng

On Mon, Jul 27, 2009 at 8:38 AM, Saurabh Nanda<sa...@gmail.com> wrote:
>
>> #2 Compressed logs in textfile tables: 60sec (filesize of 736 MB over 8
>> compressed files)
>> #3 Compressed logs in sequencefile tables: 101sec (filesize of 4,773 MB
>> over 126 compressed files)
>
> Why is there such a *big* difference in compression ratios between the gzip
> utility and Hive?
>
> Uncompressed file size: approx 3500 MB
> Gzip utility: approx 250 MB
> org.apache.hadoop.io.compress.GzipCodec (BLOCK): approx 1600 MB
> org.apache.hadoop.io.compress.DefaultCodec (BLOCK): approx 1700 MB
>
> Saurabh.
> --
> http://nandz.blogspot.com
> http://foodieforlife.blogspot.com
>



-- 
Yours,
Zheng

Re: bz2 Splits.

Posted by Prasad Chakka <pc...@facebook.com>.
Sequence Block compression happens on smaller chunks (around 1MB I think) so the compression ration would be smaller than compressing complete file.


________________________________
From: Saurabh Nanda <sa...@gmail.com>
Reply-To: <hi...@hadoop.apache.org>
Date: Mon, 27 Jul 2009 08:38:08 -0700
To: <hi...@hadoop.apache.org>
Subject: Re: Re: bz2 Splits.


#2 Compressed logs in textfile tables: 60sec (filesize of 736 MB over 8 compressed files)
#3 Compressed logs in sequencefile tables: 101sec (filesize of 4,773 MB over 126 compressed files)

Why is there such a *big* difference in compression ratios between the gzip utility and Hive?

Uncompressed file size: approx 3500 MB
Gzip utility: approx 250 MB
org.apache.hadoop.io.compress.GzipCodec (BLOCK): approx 1600 MB
org.apache.hadoop.io.compress.DefaultCodec (BLOCK): approx 1700 MB

Saurabh.
--
http://nandz.blogspot.com
http://foodieforlife.blogspot.com


Re: Re: bz2 Splits.

Posted by Saurabh Nanda <sa...@gmail.com>.
> #2 Compressed logs in textfile tables: 60sec (filesize of 736 MB over 8
> compressed files)
> #3 Compressed logs in sequencefile tables: 101sec (filesize of 4,773 MB
> over 126 compressed files)
>

Why is there such a *big* difference in compression ratios between the gzip
utility and Hive?

Uncompressed file size: approx 3500 MB
Gzip utility: approx 250 MB
org.apache.hadoop.io.compress.GzipCodec (BLOCK): approx 1600 MB
org.apache.hadoop.io.compress.DefaultCodec (BLOCK): approx 1700 MB

Saurabh.
-- 
http://nandz.blogspot.com
http://foodieforlife.blogspot.com

Re: Re: bz2 Splits.

Posted by Saurabh Nanda <sa...@gmail.com>.
> #1 Uncompressed logs in textfile tables: 106sec (filesize of 7,686 MB over
> 8 uncompressed files)
> #2 Compressed logs in textfile tables: 60sec (filesize of 736 MB over 8
> compressed files)
> #3 Compressed logs in sequencefile tables: 101sec (filesize of 4,773 MB
> over 126 compressed files)
>

Some more stats, if anyone's interested. I ran all the three tables
(described above) through my ETL query (as described in
http://nandz.blogspot.com/2009/07/using-hive-for-weblog-analysis.html)

#1: 699sec with 1,561,633 rows in the final table
#2: 563sec with 1,561,633 rows in the final table
#3: 697sec with 1,654,291 rows in the final table (!)

For #3 I've got a different row count. I tried importing the gzipped files &
putting them through ETL again and landed up with 1,743,377 rows the second
time! Will spend some more time to see where I'm going wrong.

However, with these stats it seems that approach #2 gives best results with
complex queries.

#1 = Uncompressed log files into uncompressed textfile tables
#2 = Inserting #1 with compression on into sequencefile tables
#3 = Compressed log files (gzip) into textfile tables

Saurabh.
-- 
http://nandz.blogspot.com
http://foodieforlife.blogspot.com

Re: Re: bz2 Splits.

Posted by Saurabh Nanda <sa...@gmail.com>.
> Can you help put that information into appropriate place on the wiki
> (where you see fit)?
> Thanks for the help.


Will do.


> By the way, I guess we need to debug what went wrong with the
> "count(1)" queries. There is definitely something going wrong.


My bad here. I think I forgot to import some files when running the queries
earlier. The counts are exactly the same. However the timings for "select
count(1)" queries are very different.

#1 Uncompressed logs in textfile tables: 106sec (filesize of 7,686 MB over 8
uncompressed files)
#2 Compressed logs in textfile tables: 60sec (filesize of 736 MB over 8
compressed files)
#3 Compressed logs in sequencefile tables: 101sec (filesize of 4,773 MB over
126 compressed files)



> For the timing, how much mapper slots do you have in your cluster?


I have a 4-node cluster with mapred.reduce.tasks=17 Is that what you mean by
mapper slots?


> Approach #3:
> a) import gzip files into textfile table
> b) set hive.exec.compress.output to true
> c) inserted into sequencefile table
> This will create bigger sequencefiles which will help reducing the
> overhead. This is better than Approach #2 because jobs from the
> sequencefile tables will have more mappers.


This is exactly what I did in #3 above. But, from those benchmarks #2 seems
to give the best results, both, in terms of file size and speed. Is that not
what you were expecting?

Saurabh.
-- 
http://nandz.blogspot.com
http://foodieforlife.blogspot.com

Re: Re: bz2 Splits.

Posted by Zheng Shao <zs...@gmail.com>.
Hi Saurabh,

Can you help put that information into appropriate place on the wiki
(where you see fit)?
Thanks for the help.

By the way, I guess we need to debug what went wrong with the
"count(1)" queries. There is definitely something going wrong.

For the timing, how much mapper slots do you have in your cluster?


I think you might want to consider this:

Approach #3:
a) import gzip files into textfile table
b) set hive.exec.compress.output to true
c) inserted into sequencefile table
This will create bigger sequencefiles which will help reducing the
overhead. This is better than Approach #2 because jobs from the
sequencefile tables will have more mappers.


Zheng

On Sat, Jul 25, 2009 at 3:48 AM, Saurabh Nanda<sa...@gmail.com> wrote:
>
>> TextFile means the plain text file (records delimited by "\n").
>> Compressed TextFiles are just text files compressed by gzip or bzip2
>> utility. SequenceFile is a special file format that only Hadoop can
>> understand.
>> Since your files are compressed TextFiles, you have to create a table
>> with TextFile format, in order to load the data without any
>> conversion.
>> (Compression is detected automatically for both TextFile and
>> SequenceFile - you don't need to specify it when creating a table)
>
> This really clears things up. I guess adding a note in the Wiki will put an
> end to the confusion permanently. A little note on the approach (compressed
> textfile vs compressed sequencefile) with the best performance would also be
> appreciated.
>
> Saurabh.
> --
> http://nandz.blogspot.com
> http://foodieforlife.blogspot.com
>



-- 
Yours,
Zheng

Re: Re: bz2 Splits.

Posted by Saurabh Nanda <sa...@gmail.com>.
> TextFile means the plain text file (records delimited by "\n").
> Compressed TextFiles are just text files compressed by gzip or bzip2
> utility. SequenceFile is a special file format that only Hadoop can
> understand.
> Since your files are compressed TextFiles, you have to create a table
> with TextFile format, in order to load the data without any
> conversion.
> (Compression is detected automatically for both TextFile and
> SequenceFile - you don't need to specify it when creating a table)



This really clears things up. I guess adding a note in the Wiki will put an
end to the confusion permanently. A little note on the approach (compressed
textfile vs compressed sequencefile) with the best performance would also be
appreciated.

Saurabh.
-- 
http://nandz.blogspot.com
http://foodieforlife.blogspot.com

Re: Re: bz2 Splits.

Posted by Zheng Shao <zs...@gmail.com>.
Both TextFile and SequenceFile can be compressed or uncompressed.

TextFile means the plain text file (records delimited by "\n").
Compressed TextFiles are just text files compressed by gzip or bzip2
utility.
SequenceFile is a special file format that only Hadoop can understand.

Since your files are compressed TextFiles, you have to create a table
with TextFile format, in order to load the data without any
conversion.
(Compression is detected automatically for both TextFile and
SequenceFile - you don't need to specify it when creating a table)


Does this make the things a bit clearer?

Zheng

On Sat, Jul 25, 2009 at 3:27 AM, Saurabh Nanda<sa...@gmail.com> wrote:
>
>> If you want to load data (in compressed/uncompressed text format) into
>> a table, you have to defined the table as "stored as textfile" instead
>> of "stored as sequencefile".
>
> I'm completely confused right now. If sequencefiles are not used for
> compressed data storage then what are they used for?
>
> If I have a gz file, and I want to import it as is (without gunzipping or
> using an intermediate table), what should I be doing?
>
> Saurabh.
> --
> http://nandz.blogspot.com
> http://foodieforlife.blogspot.com
>



-- 
Yours,
Zheng

Re: Re: bz2 Splits.

Posted by Saurabh Nanda <sa...@gmail.com>.
> If you want to load data (in compressed/uncompressed text format) into
>> a table, you have to defined the table as "stored as textfile" instead
>> of "stored as sequencefile".
>
>

I tried both the approaches.

Approach #1:
a) gunzip log file
b) import into textfile table
c) set hive.exec.compress.output to true
d) inserted into sequencefile table

It seems to have given me 125 files named 'attempt_*' in the partition's
directory. All under 10MB. (How do I find out the total size of a directory?
Need to see how much saving the compression resulted in)

Approach #2:   imported gzip log files into a textfile table

The files seem to have been copied as-is into the partition's directory. But
every query is always split up into 8 maps (which is the number of files I
imported). This, I guess won't help me much because I would be under
utilizing the map power I have.

Here's something interesting. I ran a SELECT COUNT(1) on all the three
tables and go different results and wildly different response times.

Gunzipped files imported into textfile table: 8,259,720 (108 sec)
sequencefile table populated by step 1d above:  8,316,946 (114 sec)
Gzip files imported into textfile tables: 8,619,980 (50 sec)

How is a simple row count differing? And surprisingly lesser maps resulted
in better performance!

Saurabh.
-- 
http://nandz.blogspot.com
http://foodieforlife.blogspot.com

Re: Re: bz2 Splits.

Posted by Saurabh Nanda <sa...@gmail.com>.
> If you want to load data (in compressed/uncompressed text format) into
> a table, you have to defined the table as "stored as textfile" instead
> of "stored as sequencefile".


I'm completely confused right now. If sequencefiles are not used for
compressed data storage then what are they used for?

If I have a gz file, and I want to import it as is (without gunzipping or
using an intermediate table), what should I be doing?

Saurabh.
-- 
http://nandz.blogspot.com
http://foodieforlife.blogspot.com

Re: Re: bz2 Splits.

Posted by Zheng Shao <zs...@gmail.com>.
Hi Saurabh,

If you want to load data (in compressed/uncompressed text format) into
a table, you have to defined the table as "stored as textfile" instead
of "stored as sequencefile".

Can you try again and let us know?

Zheng

On Sat, Jul 25, 2009 at 3:05 AM, Saurabh Nanda<sa...@gmail.com> wrote:
> I tried the following and ran into an error message:
>
> create table compressed_raw(line string) partitioned by(dt string)
> row format delimited fields terminated by '\t' lines terminated by '\n'
> stored as sequencefile;
>
> hive> load data local inpath
> '/tmp/weblogs/20090602000000-172.16.1.40-access.log.gz' into table
> compressed_raw partition(dt='2009-06-01');
> Copying data from file:/tmp/weblogs/20090602000000-172.16.1.40-access.log.gz
> Loading data to table compressed_raw partition {dt=2009-06-01}
> Failed with exception Cannot load text files into a table stored as
> SequenceFile.
> FAILED: Execution Error, return code 1 from
> org.apache.hadoop.hive.ql.exec.MoveTask
>
> I guess this is what the following thread is talking about --
> http://mail-archives.apache.org/mod_mbox/hadoop-hive-user/200903.mbox/%3C49C10E12.7010208@ecircle.com%3E
>
> To sum up the discussion there, do I have to first import into a textfile
> table, set hive.exec.compress.output to true, and then insert into a
> sequencefile table? If that's the case, I don't understand why I have to
> explicitly set hive.exec.compress.output? Shouldn't the fact that the target
> is a sequencefile table, achieve the desired result?
>
> I'm on hadoop-0.18.3 & hive-0.3.0
>
> PS: More details on the Wiki around compresses storage would be really
> appreciated.
>
> Saurabh.
>
> On Fri, Jul 24, 2009 at 10:02 PM, Neal Richter <nr...@gmail.com> wrote:
>>
>> gz files work fine.  We're attaching daily directories of gziped logs
>> in S3 as hive table partitions.
>>
>> Best to have your logrotator do hourly rotation to create lots of gz
>> files for better mapping.  OR one could use zcat, split, and gzip to
>> divide into smaller chunks if you really only have one gz file per
>> partition.
>>
>> On Fri, Jul 24, 2009 at 9:48 AM, <bc...@gmail.com> wrote:
>> > Have not checked gzip out yet but Hive is happy with .bz2 files. The
>> > documentation on this is spotty. It seems that any Hadoop supported
>> > compression will work. The issue with .gz files is that they will not be
>> > splittable. That is one map will process an entire file so if your .gz
>> > files
>> > are large and you have more map capability than files you will not be
>> > able
>> > to make use of it.
>> >
>> > On Jul 24, 2009 10:09am, Saurabh Nanda <sa...@gmail.com> wrote:
>> >> Please excuse my ignorance, but can I import gzip compressed files
>> >> directly as Hive tables? I have separate gzip files for each days
>> >> weblog
>> >> data. Right now I am gunzipping them and then importing into a raw
>> >> table.
>> >> Can I import the gzipped files directly into Hive?
>> >>
>> >>
>> >> Saurabh.
>> >>
>> >> On Wed, Jul 22, 2009 at 1:07 AM, Ashish Thusoo athusoo@facebook.com>
>> >> wrote:
>> >>
>> >> I don't think these are splittable. Compression on sequencefiles is
>> >> splittable across sequencefile blocks.
>> >>
>> >>
>> >>
>> >> Ashish
>> >>
>> >>
>> >>
>> >>
>> >> -----Original Message-----
>> >>
>> >> From: Bill Craig [mailto:bcraig7@gmail.com]
>> >>
>> >> Sent: Tuesday, July 21, 2009 8:06 AM
>> >>
>> >> To: hive-user@hadoop.apache.org
>> >>
>> >> Subject: bz2 Splits.
>> >>
>> >>
>> >>
>> >> I loaded 5 files of bzip2 compressed data into a table in Hive. Three
>> >> are
>> >> small test files containing 10,000 records. Two were large ~8Gb
>> >> compressed.
>> >>
>> >> When I run a query against the table I see three tasks that complete
>> >> almost immediately and two tasks that run for a very long time. It
>> >> appears
>> >> to me that Hive/Hadoop is not splitting the input of the *.bz2. I have
>> >> seen
>> >> some old mails about this, but could not find any resolution for this
>> >> problem. I compressed the files using the Apache bz2 jar, the file are
>> >> named
>> >> *.bz2. I am using Hadoop
>> >>
>> >>
>> >> 0.19.1 r745977
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> --
>> >> http://nandz.blogspot.com
>> >> http://foodieforlife.blogspot.com
>> >>
>
>
>
> --
> http://nandz.blogspot.com
> http://foodieforlife.blogspot.com
>



-- 
Yours,
Zheng

Re: Re: bz2 Splits.

Posted by Saurabh Nanda <sa...@gmail.com>.
I tried the following and ran into an error message:

create table compressed_raw(line string) partitioned by(dt string)
row format delimited fields terminated by '\t' lines terminated by '\n'
stored as sequencefile;

hive> load data local inpath
'/tmp/weblogs/20090602000000-172.16.1.40-access.log.gz' into table
compressed_raw partition(dt='2009-06-01');
Copying data from file:/tmp/weblogs/20090602000000-172.16.1.40-access.log.gz
Loading data to table compressed_raw partition {dt=2009-06-01}
Failed with exception Cannot load text files into a table stored as
SequenceFile.
FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.MoveTask

I guess this is what the following thread is talking about --
http://mail-archives.apache.org/mod_mbox/hadoop-hive-user/200903.mbox/%3C49C10E12.7010208@ecircle.com%3E

To sum up the discussion there, do I have to first import into a textfile
table, set hive.exec.compress.output to true, and then insert into a
sequencefile table? If that's the case, I don't understand why I have to
explicitly set hive.exec.compress.output? Shouldn't the fact that the target
is a sequencefile table, achieve the desired result?

I'm on hadoop-0.18.3 & hive-0.3.0

PS: More details on the Wiki around compresses storage would be really
appreciated.

Saurabh.

On Fri, Jul 24, 2009 at 10:02 PM, Neal Richter <nr...@gmail.com> wrote:

> gz files work fine.  We're attaching daily directories of gziped logs
> in S3 as hive table partitions.
>
> Best to have your logrotator do hourly rotation to create lots of gz
> files for better mapping.  OR one could use zcat, split, and gzip to
> divide into smaller chunks if you really only have one gz file per
> partition.
>
> On Fri, Jul 24, 2009 at 9:48 AM, <bc...@gmail.com> wrote:
> > Have not checked gzip out yet but Hive is happy with .bz2 files. The
> > documentation on this is spotty. It seems that any Hadoop supported
> > compression will work. The issue with .gz files is that they will not be
> > splittable. That is one map will process an entire file so if your .gz
> files
> > are large and you have more map capability than files you will not be
> able
> > to make use of it.
> >
> > On Jul 24, 2009 10:09am, Saurabh Nanda <sa...@gmail.com> wrote:
> >> Please excuse my ignorance, but can I import gzip compressed files
> >> directly as Hive tables? I have separate gzip files for each days weblog
> >> data. Right now I am gunzipping them and then importing into a raw
> table.
> >> Can I import the gzipped files directly into Hive?
> >>
> >>
> >> Saurabh.
> >>
> >> On Wed, Jul 22, 2009 at 1:07 AM, Ashish Thusoo athusoo@facebook.com>
> >> wrote:
> >>
> >> I don't think these are splittable. Compression on sequencefiles is
> >> splittable across sequencefile blocks.
> >>
> >>
> >>
> >> Ashish
> >>
> >>
> >>
> >>
> >> -----Original Message-----
> >>
> >> From: Bill Craig [mailto:bcraig7@gmail.com]
> >>
> >> Sent: Tuesday, July 21, 2009 8:06 AM
> >>
> >> To: hive-user@hadoop.apache.org
> >>
> >> Subject: bz2 Splits.
> >>
> >>
> >>
> >> I loaded 5 files of bzip2 compressed data into a table in Hive. Three
> are
> >> small test files containing 10,000 records. Two were large ~8Gb
> compressed.
> >>
> >> When I run a query against the table I see three tasks that complete
> >> almost immediately and two tasks that run for a very long time. It
> appears
> >> to me that Hive/Hadoop is not splitting the input of the *.bz2. I have
> seen
> >> some old mails about this, but could not find any resolution for this
> >> problem. I compressed the files using the Apache bz2 jar, the file are
> named
> >> *.bz2. I am using Hadoop
> >>
> >>
> >> 0.19.1 r745977
> >>
> >>
> >>
> >>
> >>
> >>
> >> --
> >> http://nandz.blogspot.com
> >> http://foodieforlife.blogspot.com
> >>
>



-- 
http://nandz.blogspot.com
http://foodieforlife.blogspot.com

Re: Re: bz2 Splits.

Posted by Neal Richter <nr...@gmail.com>.
gz files work fine.  We're attaching daily directories of gziped logs
in S3 as hive table partitions.

Best to have your logrotator do hourly rotation to create lots of gz
files for better mapping.  OR one could use zcat, split, and gzip to
divide into smaller chunks if you really only have one gz file per
partition.

On Fri, Jul 24, 2009 at 9:48 AM, <bc...@gmail.com> wrote:
> Have not checked gzip out yet but Hive is happy with .bz2 files. The
> documentation on this is spotty. It seems that any Hadoop supported
> compression will work. The issue with .gz files is that they will not be
> splittable. That is one map will process an entire file so if your .gz files
> are large and you have more map capability than files you will not be able
> to make use of it.
>
> On Jul 24, 2009 10:09am, Saurabh Nanda <sa...@gmail.com> wrote:
>> Please excuse my ignorance, but can I import gzip compressed files
>> directly as Hive tables? I have separate gzip files for each days weblog
>> data. Right now I am gunzipping them and then importing into a raw table.
>> Can I import the gzipped files directly into Hive?
>>
>>
>> Saurabh.
>>
>> On Wed, Jul 22, 2009 at 1:07 AM, Ashish Thusoo athusoo@facebook.com>
>> wrote:
>>
>> I don't think these are splittable. Compression on sequencefiles is
>> splittable across sequencefile blocks.
>>
>>
>>
>> Ashish
>>
>>
>>
>>
>> -----Original Message-----
>>
>> From: Bill Craig [mailto:bcraig7@gmail.com]
>>
>> Sent: Tuesday, July 21, 2009 8:06 AM
>>
>> To: hive-user@hadoop.apache.org
>>
>> Subject: bz2 Splits.
>>
>>
>>
>> I loaded 5 files of bzip2 compressed data into a table in Hive. Three are
>> small test files containing 10,000 records. Two were large ~8Gb compressed.
>>
>> When I run a query against the table I see three tasks that complete
>> almost immediately and two tasks that run for a very long time. It appears
>> to me that Hive/Hadoop is not splitting the input of the *.bz2. I have seen
>> some old mails about this, but could not find any resolution for this
>> problem. I compressed the files using the Apache bz2 jar, the file are named
>> *.bz2. I am using Hadoop
>>
>>
>> 0.19.1 r745977
>>
>>
>>
>>
>>
>>
>> --
>> http://nandz.blogspot.com
>> http://foodieforlife.blogspot.com
>>

Re: Re: bz2 Splits.

Posted by bc...@gmail.com.
Have not checked gzip out yet but Hive is happy with .bz2 files. The  
documentation on this is spotty. It seems that any Hadoop supported  
compression will work. The issue with .gz files is that they will not be  
splittable. That is one map will process an entire file so if your .gz  
files are large and you have more map capability than files you will not be  
able to make use of it.

On Jul 24, 2009 10:09am, Saurabh Nanda <sa...@gmail.com> wrote:
> Please excuse my ignorance, but can I import gzip compressed files  
> directly as Hive tables? I have separate gzip files for each days weblog  
> data. Right now I am gunzipping them and then importing into a raw table.  
> Can I import the gzipped files directly into Hive?


> Saurabh.

> On Wed, Jul 22, 2009 at 1:07 AM, Ashish Thusoo athusoo@facebook.com>  
> wrote:

> I don't think these are splittable. Compression on sequencefiles is  
> splittable across sequencefile blocks.



> Ashish




> -----Original Message-----

> From: Bill Craig [mailto:bcraig7@gmail.com]

> Sent: Tuesday, July 21, 2009 8:06 AM

> To: hive-user@hadoop.apache.org

> Subject: bz2 Splits.



> I loaded 5 files of bzip2 compressed data into a table in Hive. Three are  
> small test files containing 10,000 records. Two were large ~8Gb  
> compressed.

> When I run a query against the table I see three tasks that complete  
> almost immediately and two tasks that run for a very long time. It  
> appears to me that Hive/Hadoop is not splitting the input of the *.bz2. I  
> have seen some old mails about this, but could not find any resolution  
> for this problem. I compressed the files using the Apache bz2 jar, the  
> file are named *.bz2. I am using Hadoop


> 0.19.1 r745977






> --
> http://nandz.blogspot.com
> http://foodieforlife.blogspot.com


Re: bz2 Splits.

Posted by Saurabh Nanda <sa...@gmail.com>.
Please excuse my ignorance, but can I import gzip compressed files directly
as Hive tables? I have separate gzip files for each days weblog data. Right
now I am gunzipping them and then importing into a raw table. Can I import
the gzipped files directly into Hive?

Saurabh.

On Wed, Jul 22, 2009 at 1:07 AM, Ashish Thusoo <at...@facebook.com> wrote:

> I don't think these are splittable. Compression on sequencefiles is
> splittable across sequencefile blocks.
>
> Ashish
>
> -----Original Message-----
> From: Bill Craig [mailto:bcraig7@gmail.com]
> Sent: Tuesday, July 21, 2009 8:06 AM
> To: hive-user@hadoop.apache.org
> Subject: bz2 Splits.
>
> I loaded 5 files of bzip2 compressed data into a table in Hive. Three are
> small test files containing 10,000 records. Two were large ~8Gb compressed.
> When I run a query against the table I see three tasks that complete almost
> immediately and two tasks that run for a very long time. It appears to me
> that Hive/Hadoop is not splitting the input of the *.bz2. I have seen some
> old mails about this, but could not find any resolution for this problem. I
> compressed the files using the Apache bz2 jar, the file are named *.bz2. I
> am using Hadoop
> 0.19.1 r745977
>



-- 
http://nandz.blogspot.com
http://foodieforlife.blogspot.com

Re: bz2 Splits.

Posted by Zheng Shao <zs...@gmail.com>.
There are some work along this direction in the hadoop land, but it's
not committed yet:
https://issues.apache.org/jira/browse/HADOOP-4012

For the short term, we won't be able to split bzip files.

If your bzip files are generated outside of hadoop, please split the
files before doing compression (so you will load many smaller files to
hadoop/hive).
If your bzip files are generated by hadoop/hive, please change the
output file format to SequenceFile format. SequenceFile formats are
splittable.

Zheng

On Tue, Jul 21, 2009 at 12:37 PM, Ashish Thusoo<at...@facebook.com> wrote:
> I don't think these are splittable. Compression on sequencefiles is splittable across sequencefile blocks.
>
> Ashish
>
> -----Original Message-----
> From: Bill Craig [mailto:bcraig7@gmail.com]
> Sent: Tuesday, July 21, 2009 8:06 AM
> To: hive-user@hadoop.apache.org
> Subject: bz2 Splits.
>
> I loaded 5 files of bzip2 compressed data into a table in Hive. Three are small test files containing 10,000 records. Two were large ~8Gb compressed.
> When I run a query against the table I see three tasks that complete almost immediately and two tasks that run for a very long time. It appears to me that Hive/Hadoop is not splitting the input of the *.bz2. I have seen some old mails about this, but could not find any resolution for this problem. I compressed the files using the Apache bz2 jar, the file are named *.bz2. I am using Hadoop
> 0.19.1 r745977
>



-- 
Yours,
Zheng

RE: bz2 Splits.

Posted by Ashish Thusoo <at...@facebook.com>.
I don't think these are splittable. Compression on sequencefiles is splittable across sequencefile blocks.

Ashish 

-----Original Message-----
From: Bill Craig [mailto:bcraig7@gmail.com] 
Sent: Tuesday, July 21, 2009 8:06 AM
To: hive-user@hadoop.apache.org
Subject: bz2 Splits.

I loaded 5 files of bzip2 compressed data into a table in Hive. Three are small test files containing 10,000 records. Two were large ~8Gb compressed.
When I run a query against the table I see three tasks that complete almost immediately and two tasks that run for a very long time. It appears to me that Hive/Hadoop is not splitting the input of the *.bz2. I have seen some old mails about this, but could not find any resolution for this problem. I compressed the files using the Apache bz2 jar, the file are named *.bz2. I am using Hadoop
0.19.1 r745977