You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Steinmaurer Thomas <Th...@scch.at> on 2011/07/28 15:31:20 UTC

GZ better than LZO?

Hello,

 

we ran a test client generating data into GZ and LZO compressed table.
Equal data sets (number of rows: 1008000 and the same table schema). ~
7.78 GB disk space uncompressed in HDFS. LZO is ~ 887 MB whereas GZ is ~
444 MB, so basically half of LZO.

 

Execution time of the data generating client was 1373 seconds into the
uncompressed table, 3374 sec. into LZO and 2198 sec. into GZ. The data
generation client is based on HTablePool and using batch operations.

 

So in our (simple) test, GZ beats LZO in both, disk usage and execution
time of the client. We haven't tried reads yet.

 

Is this an expected result? I thought LZO is the recommended compression
algorithm? Or does LZO outperforms GZ with a growing amount of data or
in read scenarios?

 

Regards,

Thomas

RE: GZ better than LZO?

Posted by Steinmaurer Thomas <Th...@scch.at>.

Ah, sorry. 550 mio. rows and not billions.

Thomas

-----Original Message-----
From: Steinmaurer Thomas [mailto:Thomas.Steinmaurer@scch.at] 
Sent: Donnerstag, 18. August 2011 14:55
To: user@hbase.apache.org
Subject: RE: GZ better than LZO?

After our tests with ~550 bill. rows, we probably will go with Snappy. Our test showed better write performance compared to GZ and LZO, with only slightly more disk usage compared to LZO. 

Haven't looked at comparing read performance for our pattern, but performance of Snappy should be sufficient here as well.

Regards,
Thomas

-----Original Message-----
From: BlueDavy Lin [mailto:bluedavy@gmail.com]
Sent: Donnerstag, 18. August 2011 04:06
To: user@hbase.apache.org
Subject: Re: GZ better than LZO?

We test gz also,but when we use gz,it seems will cause memory out of usage.

It seems maybe because gz not use Deflater/Inflater correctly (not call end method explicit)

2011/8/18 Sandy Pratt <pr...@adobe.com>:
> I also switched from LZO to GZ a while back.  I didn't do any micro-benchmarks, but I did note that the overall time of some MR jobs on our small cluster (~2B records at the time IIRC) went down slightly after the change.
>
> The primary reason I switched was not due to performance, however, but due to compression ratio and licensing/build issues.  AFAIK, the GZ code is branched, tested and released along with Hadoop, whereas LZO wasn't when I last used it (not an academic concern, it turned out).
>
> One speculation about where the discrepancy between micro-benchmarks and actual use may arise: do benchmarks include the cost of marshaling the data (64MB before compression region say) from disk?  If the benchmark starts with the data in memory (and how do you know if it does or not, given the layers of cache between you and the platters) then it might not reflect real world HBase scenarios.  GZ may need to read only 20MB while LZO might need to read 32MB.  Does that difference dominate the computational cost of decompression?
>
>
> Sandy
>
>
>> -----Original Message-----
>> From: lars hofhansl [mailto:lhofhansl@yahoo.com]
>> Sent: Friday, July 29, 2011 08:44
>> To: user@hbase.apache.org
>> Subject: Re: GZ better than LZO?
>>
>> For what's it worth I had similar observations.
>>
>> I simulated heavy write load and I found that NO compression was the 
>> fastest, followed by GZ, followed by LZO.
>> After the tests I did a major_compact of the tables, and I included 
>> that time in the total.
>> Also these tests where done with a single region server, in order to 
>> isolate compression performance better.
>>
>>
>> So at least you're not the only one seeing this :) However, it seems 
>> that this heavily depends on the details of your setup (relative CPU 
>> vs IO performance, for example).
>>
>>
>> ----- Original Message -----
>> From: Steinmaurer Thomas <Th...@scch.at>
>> To: user@hbase.apache.org
>> Cc:
>> Sent: Thursday, July 28, 2011 11:27 PM
>> Subject: RE: GZ better than LZO?
>>
>> Hello,
>>
>> we simulated real looking data (as in our expected production system) 
>> in respect to row-key, column families ...
>>
>> The test client (TDG) basically implement a three-part row key.
>>
>> vehicle-device-reversedtimestamp
>>
>> vehicle: 16 characters, left-padded with "0"
>> device: 16 characters, left-padded with "0"
>> reversedtimestamp: YYYYMMDDhhmmss
>>
>> There are four column families, although currently only one called 
>> "data_details" is filled by the TDG. The others are reserved for later use.
>> Replication (REPLICATION_SCOPE = 1) is enabled for all column families.
>>
>> The qualifiers for "data_details" are basically based on an enum with
>> 25 members. And each member has three occurrences, defined by adding 
>> a different suffix to the qualifier name.
>>
>> Let's say, there is an enum member called "temperature1", then there 
>> are the following qualifiers used:
>>
>> temperature1_value
>> temperature1_unit
>> temperature1_validity
>>
>> So, we end up with 25 * 3 = 75 qualifiers per row, filled with random 
>> values in a range from [0, 65535] each.
>>
>> TDG basically allows to define the number of simulated clients (one 
>> thread per client), enabled to run them in multi-threaded mode or in
>> single- threaded mode. Data volume is defined by number of iterations 
>> of the set of simulated clients, the number of iterations per client, 
>> number of devices per client and number of rows per device.
>>
>> After the test has finished, 1.008.000 rows were inserted and 
>> successfully replicated to our backup test cluster.
>>
>> Any further ideas?
>>
>> PS: We are currently running a test with ~ 4mio rows following the 
>> pattern above.
>>
>> Thanks,
>> Thomas
>>
>>
>>
>> -----Original Message-----
>> From: Chiku [mailto:hakisenin@gmail.com]
>> Sent: Donnerstag, 28. Juli 2011 15:35
>> To: user@hbase.apache.org
>> Subject: Re: GZ better than LZO?
>>
>> Are you getting this results because of the nature of test data generated?
>>
>> Would you mind sharing some details about the test client and the 
>> data it generates?
>>
>>
>> On Thu, Jul 28, 2011 at 7:01 PM, Steinmaurer Thomas < 
>> Thomas.Steinmaurer@scch.at> wrote:
>>
>> > Hello,
>> >
>> >
>> >
>> > we ran a test client generating data into GZ and LZO compressed table.
>> > Equal data sets (number of rows: 1008000 and the same table 
>> > schema). ~
>> > 7.78 GB disk space uncompressed in HDFS. LZO is ~ 887 MB whereas GZ 
>> > is
>>
>> > ~
>> > 444 MB, so basically half of LZO.
>> >
>> >
>> >
>> > Execution time of the data generating client was 1373 seconds into 
>> > the
>>
>> > uncompressed table, 3374 sec. into LZO and 2198 sec. into GZ. The 
>> > data
>>
>> > generation client is based on HTablePool and using batch operations.
>> >
>> >
>> >
>> > So in our (simple) test, GZ beats LZO in both, disk usage and 
>> > execution time of the client. We haven't tried reads yet.
>> >
>> >
>> >
>> > Is this an expected result? I thought LZO is the recommended 
>> > compression algorithm? Or does LZO outperforms GZ with a growing 
>> > amount of data or in read scenarios?
>> >
>> >
>> >
>> > Regards,
>> >
>> > Thomas
>> >
>> >
>> >
>> >
>
>



--
=============================
|     BlueDavy                                      |
|     http://www.bluedavy.com                |
=============================

RE: GZ better than LZO?

Posted by Steinmaurer Thomas <Th...@scch.at>.

After our tests with ~550 bill. rows, we probably will go with Snappy. Our test showed better write performance compared to GZ and LZO, with only slightly more disk usage compared to LZO. 

Haven't looked at comparing read performance for our pattern, but performance of Snappy should be sufficient here as well.

Regards,
Thomas

-----Original Message-----
From: BlueDavy Lin [mailto:bluedavy@gmail.com] 
Sent: Donnerstag, 18. August 2011 04:06
To: user@hbase.apache.org
Subject: Re: GZ better than LZO?

We test gz also,but when we use gz,it seems will cause memory out of usage.

It seems maybe because gz not use Deflater/Inflater correctly (not call end method explicit)

2011/8/18 Sandy Pratt <pr...@adobe.com>:
> I also switched from LZO to GZ a while back.  I didn't do any micro-benchmarks, but I did note that the overall time of some MR jobs on our small cluster (~2B records at the time IIRC) went down slightly after the change.
>
> The primary reason I switched was not due to performance, however, but due to compression ratio and licensing/build issues.  AFAIK, the GZ code is branched, tested and released along with Hadoop, whereas LZO wasn't when I last used it (not an academic concern, it turned out).
>
> One speculation about where the discrepancy between micro-benchmarks and actual use may arise: do benchmarks include the cost of marshaling the data (64MB before compression region say) from disk?  If the benchmark starts with the data in memory (and how do you know if it does or not, given the layers of cache between you and the platters) then it might not reflect real world HBase scenarios.  GZ may need to read only 20MB while LZO might need to read 32MB.  Does that difference dominate the computational cost of decompression?
>
>
> Sandy
>
>
>> -----Original Message-----
>> From: lars hofhansl [mailto:lhofhansl@yahoo.com]
>> Sent: Friday, July 29, 2011 08:44
>> To: user@hbase.apache.org
>> Subject: Re: GZ better than LZO?
>>
>> For what's it worth I had similar observations.
>>
>> I simulated heavy write load and I found that NO compression was the 
>> fastest, followed by GZ, followed by LZO.
>> After the tests I did a major_compact of the tables, and I included 
>> that time in the total.
>> Also these tests where done with a single region server, in order to 
>> isolate compression performance better.
>>
>>
>> So at least you're not the only one seeing this :) However, it seems 
>> that this heavily depends on the details of your setup (relative CPU 
>> vs IO performance, for example).
>>
>>
>> ----- Original Message -----
>> From: Steinmaurer Thomas <Th...@scch.at>
>> To: user@hbase.apache.org
>> Cc:
>> Sent: Thursday, July 28, 2011 11:27 PM
>> Subject: RE: GZ better than LZO?
>>
>> Hello,
>>
>> we simulated real looking data (as in our expected production system) 
>> in respect to row-key, column families ...
>>
>> The test client (TDG) basically implement a three-part row key.
>>
>> vehicle-device-reversedtimestamp
>>
>> vehicle: 16 characters, left-padded with "0"
>> device: 16 characters, left-padded with "0"
>> reversedtimestamp: YYYYMMDDhhmmss
>>
>> There are four column families, although currently only one called 
>> "data_details" is filled by the TDG. The others are reserved for later use.
>> Replication (REPLICATION_SCOPE = 1) is enabled for all column families.
>>
>> The qualifiers for "data_details" are basically based on an enum with 
>> 25 members. And each member has three occurrences, defined by adding 
>> a different suffix to the qualifier name.
>>
>> Let's say, there is an enum member called "temperature1", then there 
>> are the following qualifiers used:
>>
>> temperature1_value
>> temperature1_unit
>> temperature1_validity
>>
>> So, we end up with 25 * 3 = 75 qualifiers per row, filled with random 
>> values in a range from [0, 65535] each.
>>
>> TDG basically allows to define the number of simulated clients (one 
>> thread per client), enabled to run them in multi-threaded mode or in 
>> single- threaded mode. Data volume is defined by number of iterations 
>> of the set of simulated clients, the number of iterations per client, 
>> number of devices per client and number of rows per device.
>>
>> After the test has finished, 1.008.000 rows were inserted and 
>> successfully replicated to our backup test cluster.
>>
>> Any further ideas?
>>
>> PS: We are currently running a test with ~ 4mio rows following the 
>> pattern above.
>>
>> Thanks,
>> Thomas
>>
>>
>>
>> -----Original Message-----
>> From: Chiku [mailto:hakisenin@gmail.com]
>> Sent: Donnerstag, 28. Juli 2011 15:35
>> To: user@hbase.apache.org
>> Subject: Re: GZ better than LZO?
>>
>> Are you getting this results because of the nature of test data generated?
>>
>> Would you mind sharing some details about the test client and the 
>> data it generates?
>>
>>
>> On Thu, Jul 28, 2011 at 7:01 PM, Steinmaurer Thomas < 
>> Thomas.Steinmaurer@scch.at> wrote:
>>
>> > Hello,
>> >
>> >
>> >
>> > we ran a test client generating data into GZ and LZO compressed table.
>> > Equal data sets (number of rows: 1008000 and the same table 
>> > schema). ~
>> > 7.78 GB disk space uncompressed in HDFS. LZO is ~ 887 MB whereas GZ 
>> > is
>>
>> > ~
>> > 444 MB, so basically half of LZO.
>> >
>> >
>> >
>> > Execution time of the data generating client was 1373 seconds into 
>> > the
>>
>> > uncompressed table, 3374 sec. into LZO and 2198 sec. into GZ. The 
>> > data
>>
>> > generation client is based on HTablePool and using batch operations.
>> >
>> >
>> >
>> > So in our (simple) test, GZ beats LZO in both, disk usage and 
>> > execution time of the client. We haven't tried reads yet.
>> >
>> >
>> >
>> > Is this an expected result? I thought LZO is the recommended 
>> > compression algorithm? Or does LZO outperforms GZ with a growing 
>> > amount of data or in read scenarios?
>> >
>> >
>> >
>> > Regards,
>> >
>> > Thomas
>> >
>> >
>> >
>> >
>
>



--
=============================
|     BlueDavy                                      |
|     http://www.bluedavy.com                |
=============================

RE: GZ better than LZO?

Posted by "Srikanth P. Shreenivas" <Sr...@mindtree.com>.

If it helps, I tried LZO setup on CDH3 on my Ubuntu.
I have documented the steps here, should work fine for others too.

http://www.srikanthps.com/2011/08/configuring-lzo-compression-for-cdh3.html

Regards,
Srikanth


________________________________________
From: Sandy Pratt [prattrs@adobe.com]
Sent: Friday, August 19, 2011 12:21 AM
To: user@hbase.apache.org
Subject: RE: GZ better than LZO?

You're definitely going to want to use the native libraries for zlib and gzip.

http://hadoop.apache.org/common/docs/current/native_libraries.html

It's actually a fairly easy build, and it comes out of the box with CDH IIRC.  You can put a symlink to hadoop/lib/native in hbase/lib and you're done.

When HBase falls back to Java for GZ and zlib, it will definitely be a bad thing =/

Sandy


> -----Original Message-----
> From: BlueDavy Lin [mailto:bluedavy@gmail.com]
> Sent: Wednesday, August 17, 2011 19:07
> To: user@hbase.apache.org
> Subject: Re: GZ better than LZO?
>
> We test gz also,but when we use gz,it seems will cause memory out of
> usage.
>
> It seems maybe because gz not use Deflater/Inflater correctly (not call end
> method explicit)
>
> 2011/8/18 Sandy Pratt <pr...@adobe.com>:
> > I also switched from LZO to GZ a while back.  I didn't do any micro-
> benchmarks, but I did note that the overall time of some MR jobs on our
> small cluster (~2B records at the time IIRC) went down slightly after the
> change.
> >
> > The primary reason I switched was not due to performance, however, but
> due to compression ratio and licensing/build issues.  AFAIK, the GZ code is
> branched, tested and released along with Hadoop, whereas LZO wasn't
> when I last used it (not an academic concern, it turned out).
> >
> > One speculation about where the discrepancy between micro-benchmarks
> and actual use may arise: do benchmarks include the cost of marshaling the
> data (64MB before compression region say) from disk?  If the benchmark
> starts with the data in memory (and how do you know if it does or not, given
> the layers of cache between you and the platters) then it might not reflect
> real world HBase scenarios.  GZ may need to read only 20MB while LZO might
> need to read 32MB.  Does that difference dominate the computational cost
> of decompression?
> >
> >
> > Sandy
> >
> >
> >> -----Original Message-----
> >> From: lars hofhansl [mailto:lhofhansl@yahoo.com]
> >> Sent: Friday, July 29, 2011 08:44
> >> To: user@hbase.apache.org
> >> Subject: Re: GZ better than LZO?
> >>
> >> For what's it worth I had similar observations.
> >>
> >> I simulated heavy write load and I found that NO compression was the
> >> fastest, followed by GZ, followed by LZO.
> >> After the tests I did a major_compact of the tables, and I included
> >> that time in the total.
> >> Also these tests where done with a single region server, in order to
> >> isolate compression performance better.
> >>
> >>
> >> So at least you're not the only one seeing this :) However, it seems
> >> that this heavily depends on the details of your setup (relative CPU
> >> vs IO performance, for example).
> >>
> >>
> >> ----- Original Message -----
> >> From: Steinmaurer Thomas <Th...@scch.at>
> >> To: user@hbase.apache.org
> >> Cc:
> >> Sent: Thursday, July 28, 2011 11:27 PM
> >> Subject: RE: GZ better than LZO?
> >>
> >> Hello,
> >>
> >> we simulated real looking data (as in our expected production system)
> >> in respect to row-key, column families ...
> >>
> >> The test client (TDG) basically implement a three-part row key.
> >>
> >> vehicle-device-reversedtimestamp
> >>
> >> vehicle: 16 characters, left-padded with "0"
> >> device: 16 characters, left-padded with "0"
> >> reversedtimestamp: YYYYMMDDhhmmss
> >>
> >> There are four column families, although currently only one called
> >> "data_details" is filled by the TDG. The others are reserved for later use.
> >> Replication (REPLICATION_SCOPE = 1) is enabled for all column families.
> >>
> >> The qualifiers for "data_details" are basically based on an enum with
> >> 25 members. And each member has three occurrences, defined by adding
> >> a different suffix to the qualifier name.
> >>
> >> Let's say, there is an enum member called "temperature1", then there
> >> are the following qualifiers used:
> >>
> >> temperature1_value
> >> temperature1_unit
> >> temperature1_validity
> >>
> >> So, we end up with 25 * 3 = 75 qualifiers per row, filled with random
> >> values in a range from [0, 65535] each.
> >>
> >> TDG basically allows to define the number of simulated clients (one
> >> thread per client), enabled to run them in multi-threaded mode or in
> >> single- threaded mode. Data volume is defined by number of iterations
> >> of the set of simulated clients, the number of iterations per client,
> >> number of devices per client and number of rows per device.
> >>
> >> After the test has finished, 1.008.000 rows were inserted and
> >> successfully replicated to our backup test cluster.
> >>
> >> Any further ideas?
> >>
> >> PS: We are currently running a test with ~ 4mio rows following the
> >> pattern above.
> >>
> >> Thanks,
> >> Thomas
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: Chiku [mailto:hakisenin@gmail.com]
> >> Sent: Donnerstag, 28. Juli 2011 15:35
> >> To: user@hbase.apache.org
> >> Subject: Re: GZ better than LZO?
> >>
> >> Are you getting this results because of the nature of test data generated?
> >>
> >> Would you mind sharing some details about the test client and the
> >> data it generates?
> >>
> >>
> >> On Thu, Jul 28, 2011 at 7:01 PM, Steinmaurer Thomas <
> >> Thomas.Steinmaurer@scch.at> wrote:
> >>
> >> > Hello,
> >> >
> >> >
> >> >
> >> > we ran a test client generating data into GZ and LZO compressed table.
> >> > Equal data sets (number of rows: 1008000 and the same table
> >> > schema). ~
> >> > 7.78 GB disk space uncompressed in HDFS. LZO is ~ 887 MB whereas GZ
> >> > is
> >>
> >> > ~
> >> > 444 MB, so basically half of LZO.
> >> >
> >> >
> >> >
> >> > Execution time of the data generating client was 1373 seconds into
> >> > the
> >>
> >> > uncompressed table, 3374 sec. into LZO and 2198 sec. into GZ. The
> >> > data
> >>
> >> > generation client is based on HTablePool and using batch operations.
> >> >
> >> >
> >> >
> >> > So in our (simple) test, GZ beats LZO in both, disk usage and
> >> > execution time of the client. We haven't tried reads yet.
> >> >
> >> >
> >> >
> >> > Is this an expected result? I thought LZO is the recommended
> >> > compression algorithm? Or does LZO outperforms GZ with a growing
> >> > amount of data or in read scenarios?
> >> >
> >> >
> >> >
> >> > Regards,
> >> >
> >> > Thomas
> >> >
> >> >
> >> >
> >> >
> >
> >
>
>
>
> --
> =============================
> |     BlueDavy                                      |
> |     http://www.bluedavy.com                |
> =============================

________________________________

http://www.mindtree.com/email/disclaimer.html

RE: GZ better than LZO?

Posted by Sandy Pratt <pr...@adobe.com>.

You're definitely going to want to use the native libraries for zlib and gzip.

http://hadoop.apache.org/common/docs/current/native_libraries.html

It's actually a fairly easy build, and it comes out of the box with CDH IIRC.  You can put a symlink to hadoop/lib/native in hbase/lib and you're done.

When HBase falls back to Java for GZ and zlib, it will definitely be a bad thing =/

Sandy


> -----Original Message-----
> From: BlueDavy Lin [mailto:bluedavy@gmail.com]
> Sent: Wednesday, August 17, 2011 19:07
> To: user@hbase.apache.org
> Subject: Re: GZ better than LZO?
> 
> We test gz also,but when we use gz,it seems will cause memory out of
> usage.
> 
> It seems maybe because gz not use Deflater/Inflater correctly (not call end
> method explicit)
> 
> 2011/8/18 Sandy Pratt <pr...@adobe.com>:
> > I also switched from LZO to GZ a while back.  I didn't do any micro-
> benchmarks, but I did note that the overall time of some MR jobs on our
> small cluster (~2B records at the time IIRC) went down slightly after the
> change.
> >
> > The primary reason I switched was not due to performance, however, but
> due to compression ratio and licensing/build issues.  AFAIK, the GZ code is
> branched, tested and released along with Hadoop, whereas LZO wasn't
> when I last used it (not an academic concern, it turned out).
> >
> > One speculation about where the discrepancy between micro-benchmarks
> and actual use may arise: do benchmarks include the cost of marshaling the
> data (64MB before compression region say) from disk?  If the benchmark
> starts with the data in memory (and how do you know if it does or not, given
> the layers of cache between you and the platters) then it might not reflect
> real world HBase scenarios.  GZ may need to read only 20MB while LZO might
> need to read 32MB.  Does that difference dominate the computational cost
> of decompression?
> >
> >
> > Sandy
> >
> >
> >> -----Original Message-----
> >> From: lars hofhansl [mailto:lhofhansl@yahoo.com]
> >> Sent: Friday, July 29, 2011 08:44
> >> To: user@hbase.apache.org
> >> Subject: Re: GZ better than LZO?
> >>
> >> For what's it worth I had similar observations.
> >>
> >> I simulated heavy write load and I found that NO compression was the
> >> fastest, followed by GZ, followed by LZO.
> >> After the tests I did a major_compact of the tables, and I included
> >> that time in the total.
> >> Also these tests where done with a single region server, in order to
> >> isolate compression performance better.
> >>
> >>
> >> So at least you're not the only one seeing this :) However, it seems
> >> that this heavily depends on the details of your setup (relative CPU
> >> vs IO performance, for example).
> >>
> >>
> >> ----- Original Message -----
> >> From: Steinmaurer Thomas <Th...@scch.at>
> >> To: user@hbase.apache.org
> >> Cc:
> >> Sent: Thursday, July 28, 2011 11:27 PM
> >> Subject: RE: GZ better than LZO?
> >>
> >> Hello,
> >>
> >> we simulated real looking data (as in our expected production system)
> >> in respect to row-key, column families ...
> >>
> >> The test client (TDG) basically implement a three-part row key.
> >>
> >> vehicle-device-reversedtimestamp
> >>
> >> vehicle: 16 characters, left-padded with "0"
> >> device: 16 characters, left-padded with "0"
> >> reversedtimestamp: YYYYMMDDhhmmss
> >>
> >> There are four column families, although currently only one called
> >> "data_details" is filled by the TDG. The others are reserved for later use.
> >> Replication (REPLICATION_SCOPE = 1) is enabled for all column families.
> >>
> >> The qualifiers for "data_details" are basically based on an enum with
> >> 25 members. And each member has three occurrences, defined by adding
> >> a different suffix to the qualifier name.
> >>
> >> Let's say, there is an enum member called "temperature1", then there
> >> are the following qualifiers used:
> >>
> >> temperature1_value
> >> temperature1_unit
> >> temperature1_validity
> >>
> >> So, we end up with 25 * 3 = 75 qualifiers per row, filled with random
> >> values in a range from [0, 65535] each.
> >>
> >> TDG basically allows to define the number of simulated clients (one
> >> thread per client), enabled to run them in multi-threaded mode or in
> >> single- threaded mode. Data volume is defined by number of iterations
> >> of the set of simulated clients, the number of iterations per client,
> >> number of devices per client and number of rows per device.
> >>
> >> After the test has finished, 1.008.000 rows were inserted and
> >> successfully replicated to our backup test cluster.
> >>
> >> Any further ideas?
> >>
> >> PS: We are currently running a test with ~ 4mio rows following the
> >> pattern above.
> >>
> >> Thanks,
> >> Thomas
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: Chiku [mailto:hakisenin@gmail.com]
> >> Sent: Donnerstag, 28. Juli 2011 15:35
> >> To: user@hbase.apache.org
> >> Subject: Re: GZ better than LZO?
> >>
> >> Are you getting this results because of the nature of test data generated?
> >>
> >> Would you mind sharing some details about the test client and the
> >> data it generates?
> >>
> >>
> >> On Thu, Jul 28, 2011 at 7:01 PM, Steinmaurer Thomas <
> >> Thomas.Steinmaurer@scch.at> wrote:
> >>
> >> > Hello,
> >> >
> >> >
> >> >
> >> > we ran a test client generating data into GZ and LZO compressed table.
> >> > Equal data sets (number of rows: 1008000 and the same table
> >> > schema). ~
> >> > 7.78 GB disk space uncompressed in HDFS. LZO is ~ 887 MB whereas GZ
> >> > is
> >>
> >> > ~
> >> > 444 MB, so basically half of LZO.
> >> >
> >> >
> >> >
> >> > Execution time of the data generating client was 1373 seconds into
> >> > the
> >>
> >> > uncompressed table, 3374 sec. into LZO and 2198 sec. into GZ. The
> >> > data
> >>
> >> > generation client is based on HTablePool and using batch operations.
> >> >
> >> >
> >> >
> >> > So in our (simple) test, GZ beats LZO in both, disk usage and
> >> > execution time of the client. We haven't tried reads yet.
> >> >
> >> >
> >> >
> >> > Is this an expected result? I thought LZO is the recommended
> >> > compression algorithm? Or does LZO outperforms GZ with a growing
> >> > amount of data or in read scenarios?
> >> >
> >> >
> >> >
> >> > Regards,
> >> >
> >> > Thomas
> >> >
> >> >
> >> >
> >> >
> >
> >
> 
> 
> 
> --
> =============================
> |     BlueDavy                                      |
> |     http://www.bluedavy.com                |
> =============================

Re: GZ better than LZO?

Posted by BlueDavy Lin <bl...@gmail.com>.

We test gz also,but when we use gz,it seems will cause memory out of usage.

It seems maybe because gz not use Deflater/Inflater correctly (not
call end method explicit)

2011/8/18 Sandy Pratt <pr...@adobe.com>:
> I also switched from LZO to GZ a while back.  I didn't do any micro-benchmarks, but I did note that the overall time of some MR jobs on our small cluster (~2B records at the time IIRC) went down slightly after the change.
>
> The primary reason I switched was not due to performance, however, but due to compression ratio and licensing/build issues.  AFAIK, the GZ code is branched, tested and released along with Hadoop, whereas LZO wasn't when I last used it (not an academic concern, it turned out).
>
> One speculation about where the discrepancy between micro-benchmarks and actual use may arise: do benchmarks include the cost of marshaling the data (64MB before compression region say) from disk?  If the benchmark starts with the data in memory (and how do you know if it does or not, given the layers of cache between you and the platters) then it might not reflect real world HBase scenarios.  GZ may need to read only 20MB while LZO might need to read 32MB.  Does that difference dominate the computational cost of decompression?
>
>
> Sandy
>
>
>> -----Original Message-----
>> From: lars hofhansl [mailto:lhofhansl@yahoo.com]
>> Sent: Friday, July 29, 2011 08:44
>> To: user@hbase.apache.org
>> Subject: Re: GZ better than LZO?
>>
>> For what's it worth I had similar observations.
>>
>> I simulated heavy write load and I found that NO compression was the
>> fastest, followed by GZ, followed by LZO.
>> After the tests I did a major_compact of the tables, and I included that time
>> in the total.
>> Also these tests where done with a single region server, in order to isolate
>> compression performance better.
>>
>>
>> So at least you're not the only one seeing this :) However, it seems that this
>> heavily depends on the details of your setup (relative CPU vs IO
>> performance, for example).
>>
>>
>> ----- Original Message -----
>> From: Steinmaurer Thomas <Th...@scch.at>
>> To: user@hbase.apache.org
>> Cc:
>> Sent: Thursday, July 28, 2011 11:27 PM
>> Subject: RE: GZ better than LZO?
>>
>> Hello,
>>
>> we simulated real looking data (as in our expected production system) in
>> respect to row-key, column families ...
>>
>> The test client (TDG) basically implement a three-part row key.
>>
>> vehicle-device-reversedtimestamp
>>
>> vehicle: 16 characters, left-padded with "0"
>> device: 16 characters, left-padded with "0"
>> reversedtimestamp: YYYYMMDDhhmmss
>>
>> There are four column families, although currently only one called
>> "data_details" is filled by the TDG. The others are reserved for later use.
>> Replication (REPLICATION_SCOPE = 1) is enabled for all column families.
>>
>> The qualifiers for "data_details" are basically based on an enum with 25
>> members. And each member has three occurrences, defined by adding a
>> different suffix to the qualifier name.
>>
>> Let's say, there is an enum member called "temperature1", then there are
>> the following qualifiers used:
>>
>> temperature1_value
>> temperature1_unit
>> temperature1_validity
>>
>> So, we end up with 25 * 3 = 75 qualifiers per row, filled with random values in
>> a range from [0, 65535] each.
>>
>> TDG basically allows to define the number of simulated clients (one thread
>> per client), enabled to run them in multi-threaded mode or in single-
>> threaded mode. Data volume is defined by number of iterations of the set of
>> simulated clients, the number of iterations per client, number of devices per
>> client and number of rows per device.
>>
>> After the test has finished, 1.008.000 rows were inserted and successfully
>> replicated to our backup test cluster.
>>
>> Any further ideas?
>>
>> PS: We are currently running a test with ~ 4mio rows following the pattern
>> above.
>>
>> Thanks,
>> Thomas
>>
>>
>>
>> -----Original Message-----
>> From: Chiku [mailto:hakisenin@gmail.com]
>> Sent: Donnerstag, 28. Juli 2011 15:35
>> To: user@hbase.apache.org
>> Subject: Re: GZ better than LZO?
>>
>> Are you getting this results because of the nature of test data generated?
>>
>> Would you mind sharing some details about the test client and the data it
>> generates?
>>
>>
>> On Thu, Jul 28, 2011 at 7:01 PM, Steinmaurer Thomas <
>> Thomas.Steinmaurer@scch.at> wrote:
>>
>> > Hello,
>> >
>> >
>> >
>> > we ran a test client generating data into GZ and LZO compressed table.
>> > Equal data sets (number of rows: 1008000 and the same table schema). ~
>> > 7.78 GB disk space uncompressed in HDFS. LZO is ~ 887 MB whereas GZ is
>>
>> > ~
>> > 444 MB, so basically half of LZO.
>> >
>> >
>> >
>> > Execution time of the data generating client was 1373 seconds into the
>>
>> > uncompressed table, 3374 sec. into LZO and 2198 sec. into GZ. The data
>>
>> > generation client is based on HTablePool and using batch operations.
>> >
>> >
>> >
>> > So in our (simple) test, GZ beats LZO in both, disk usage and
>> > execution time of the client. We haven't tried reads yet.
>> >
>> >
>> >
>> > Is this an expected result? I thought LZO is the recommended
>> > compression algorithm? Or does LZO outperforms GZ with a growing
>> > amount of data or in read scenarios?
>> >
>> >
>> >
>> > Regards,
>> >
>> > Thomas
>> >
>> >
>> >
>> >
>
>



-- 
=============================
|     BlueDavy                                      |
|     http://www.bluedavy.com                |
=============================

RE: GZ better than LZO?

Posted by Sandy Pratt <pr...@adobe.com>.

I also switched from LZO to GZ a while back.  I didn't do any micro-benchmarks, but I did note that the overall time of some MR jobs on our small cluster (~2B records at the time IIRC) went down slightly after the change.

The primary reason I switched was not due to performance, however, but due to compression ratio and licensing/build issues.  AFAIK, the GZ code is branched, tested and released along with Hadoop, whereas LZO wasn't when I last used it (not an academic concern, it turned out).

One speculation about where the discrepancy between micro-benchmarks and actual use may arise: do benchmarks include the cost of marshaling the data (64MB before compression region say) from disk?  If the benchmark starts with the data in memory (and how do you know if it does or not, given the layers of cache between you and the platters) then it might not reflect real world HBase scenarios.  GZ may need to read only 20MB while LZO might need to read 32MB.  Does that difference dominate the computational cost of decompression?


Sandy


> -----Original Message-----
> From: lars hofhansl [mailto:lhofhansl@yahoo.com]
> Sent: Friday, July 29, 2011 08:44
> To: user@hbase.apache.org
> Subject: Re: GZ better than LZO?
> 
> For what's it worth I had similar observations.
> 
> I simulated heavy write load and I found that NO compression was the
> fastest, followed by GZ, followed by LZO.
> After the tests I did a major_compact of the tables, and I included that time
> in the total.
> Also these tests where done with a single region server, in order to isolate
> compression performance better.
> 
> 
> So at least you're not the only one seeing this :) However, it seems that this
> heavily depends on the details of your setup (relative CPU vs IO
> performance, for example).
> 
> 
> ----- Original Message -----
> From: Steinmaurer Thomas <Th...@scch.at>
> To: user@hbase.apache.org
> Cc:
> Sent: Thursday, July 28, 2011 11:27 PM
> Subject: RE: GZ better than LZO?
> 
> Hello,
> 
> we simulated real looking data (as in our expected production system) in
> respect to row-key, column families ...
> 
> The test client (TDG) basically implement a three-part row key.
> 
> vehicle-device-reversedtimestamp
> 
> vehicle: 16 characters, left-padded with "0"
> device: 16 characters, left-padded with "0"
> reversedtimestamp: YYYYMMDDhhmmss
> 
> There are four column families, although currently only one called
> "data_details" is filled by the TDG. The others are reserved for later use.
> Replication (REPLICATION_SCOPE = 1) is enabled for all column families.
> 
> The qualifiers for "data_details" are basically based on an enum with 25
> members. And each member has three occurrences, defined by adding a
> different suffix to the qualifier name.
> 
> Let's say, there is an enum member called "temperature1", then there are
> the following qualifiers used:
> 
> temperature1_value
> temperature1_unit
> temperature1_validity
> 
> So, we end up with 25 * 3 = 75 qualifiers per row, filled with random values in
> a range from [0, 65535] each.
> 
> TDG basically allows to define the number of simulated clients (one thread
> per client), enabled to run them in multi-threaded mode or in single-
> threaded mode. Data volume is defined by number of iterations of the set of
> simulated clients, the number of iterations per client, number of devices per
> client and number of rows per device.
> 
> After the test has finished, 1.008.000 rows were inserted and successfully
> replicated to our backup test cluster.
> 
> Any further ideas?
> 
> PS: We are currently running a test with ~ 4mio rows following the pattern
> above.
> 
> Thanks,
> Thomas
> 
> 
> 
> -----Original Message-----
> From: Chiku [mailto:hakisenin@gmail.com]
> Sent: Donnerstag, 28. Juli 2011 15:35
> To: user@hbase.apache.org
> Subject: Re: GZ better than LZO?
> 
> Are you getting this results because of the nature of test data generated?
> 
> Would you mind sharing some details about the test client and the data it
> generates?
> 
> 
> On Thu, Jul 28, 2011 at 7:01 PM, Steinmaurer Thomas <
> Thomas.Steinmaurer@scch.at> wrote:
> 
> > Hello,
> >
> >
> >
> > we ran a test client generating data into GZ and LZO compressed table.
> > Equal data sets (number of rows: 1008000 and the same table schema). ~
> > 7.78 GB disk space uncompressed in HDFS. LZO is ~ 887 MB whereas GZ is
> 
> > ~
> > 444 MB, so basically half of LZO.
> >
> >
> >
> > Execution time of the data generating client was 1373 seconds into the
> 
> > uncompressed table, 3374 sec. into LZO and 2198 sec. into GZ. The data
> 
> > generation client is based on HTablePool and using batch operations.
> >
> >
> >
> > So in our (simple) test, GZ beats LZO in both, disk usage and
> > execution time of the client. We haven't tried reads yet.
> >
> >
> >
> > Is this an expected result? I thought LZO is the recommended
> > compression algorithm? Or does LZO outperforms GZ with a growing
> > amount of data or in read scenarios?
> >
> >
> >
> > Regards,
> >
> > Thomas
> >
> >
> >
> >

Re: GZ better than LZO?

Posted by lars hofhansl <lh...@yahoo.com>.

For what's it worth I had similar observations.

I simulated heavy write load and I found that NO compression was the fastest, followed by GZ, followed by LZO.
After the tests I did a major_compact of the tables, and I included that time in the total.
Also these tests where done with a single region server, in order to isolate compression performance better.

So at least you're not the only one seeing this :)
However, it seems that this heavily depends on the details of your setup (relative CPU vs IO performance, for example).

----- Original Message -----
From: Steinmaurer Thomas <Th...@scch.at>
To: user@hbase.apache.org
Cc: 
Sent: Thursday, July 28, 2011 11:27 PM
Subject: RE: GZ better than LZO?

Hello,

we simulated real looking data (as in our expected production system) in
respect to row-key, column families ...

The test client (TDG) basically implement a three-part row key.

vehicle-device-reversedtimestamp

vehicle: 16 characters, left-padded with "0"
device: 16 characters, left-padded with "0"
reversedtimestamp: YYYYMMDDhhmmss

There are four column families, although currently only one called
"data_details" is filled by the TDG. The others are reserved for later
use. Replication (REPLICATION_SCOPE = 1) is enabled for all column
families.

The qualifiers for "data_details" are basically based on an enum with 25
members. And each member has three occurrences, defined by adding a
different suffix to the qualifier name.

Let's say, there is an enum member called "temperature1", then there are
the following qualifiers used:

temperature1_value
temperature1_unit
temperature1_validity

So, we end up with 25 * 3 = 75 qualifiers per row, filled with random
values in a range from [0, 65535] each.

TDG basically allows to define the number of simulated clients (one
thread per client), enabled to run them in multi-threaded mode or in
single-threaded mode. Data volume is defined by number of iterations of
the set of simulated clients, the number of iterations per client,
number of devices per client and number of rows per device.

After the test has finished, 1.008.000 rows were inserted and
successfully replicated to our backup test cluster.

Any further ideas?

PS: We are currently running a test with ~ 4mio rows following the
pattern above.

Thanks,
Thomas

-----Original Message-----
From: Chiku [mailto:hakisenin@gmail.com] 
Sent: Donnerstag, 28. Juli 2011 15:35
To: user@hbase.apache.org
Subject: Re: GZ better than LZO?

Are you getting this results because of the nature of test data
generated?

Would you mind sharing some details about the test client and the data
it generates?

On Thu, Jul 28, 2011 at 7:01 PM, Steinmaurer Thomas <
Thomas.Steinmaurer@scch.at> wrote:

> Hello,
>
>
>
> we ran a test client generating data into GZ and LZO compressed table.
> Equal data sets (number of rows: 1008000 and the same table schema). ~
> 7.78 GB disk space uncompressed in HDFS. LZO is ~ 887 MB whereas GZ is

> ~
> 444 MB, so basically half of LZO.
>
>
>
> Execution time of the data generating client was 1373 seconds into the

> uncompressed table, 3374 sec. into LZO and 2198 sec. into GZ. The data

> generation client is based on HTablePool and using batch operations.
>
>
>
> So in our (simple) test, GZ beats LZO in both, disk usage and 
> execution time of the client. We haven't tried reads yet.
>
>
>
> Is this an expected result? I thought LZO is the recommended 
> compression algorithm? Or does LZO outperforms GZ with a growing 
> amount of data or in read scenarios?
>
>
>
> Regards,
>
> Thomas
>
>
>
>

RE: GZ better than LZO?

Posted by Steinmaurer Thomas <Th...@scch.at>.

Hello,

we simulated real looking data (as in our expected production system) in
respect to row-key, column families ...

The test client (TDG) basically implement a three-part row key.

vehicle-device-reversedtimestamp

vehicle: 16 characters, left-padded with "0"
device: 16 characters, left-padded with "0"
reversedtimestamp: YYYYMMDDhhmmss

There are four column families, although currently only one called
"data_details" is filled by the TDG. The others are reserved for later
use. Replication (REPLICATION_SCOPE = 1) is enabled for all column
families.

The qualifiers for "data_details" are basically based on an enum with 25
members. And each member has three occurrences, defined by adding a
different suffix to the qualifier name.

Let's say, there is an enum member called "temperature1", then there are
the following qualifiers used:

temperature1_value
temperature1_unit
temperature1_validity

So, we end up with 25 * 3 = 75 qualifiers per row, filled with random
values in a range from [0, 65535] each.

TDG basically allows to define the number of simulated clients (one
thread per client), enabled to run them in multi-threaded mode or in
single-threaded mode. Data volume is defined by number of iterations of
the set of simulated clients, the number of iterations per client,
number of devices per client and number of rows per device.

After the test has finished, 1.008.000 rows were inserted and
successfully replicated to our backup test cluster.

Any further ideas?

PS: We are currently running a test with ~ 4mio rows following the
pattern above.

Thanks,
Thomas

-----Original Message-----
From: Chiku [mailto:hakisenin@gmail.com] 
Sent: Donnerstag, 28. Juli 2011 15:35
To: user@hbase.apache.org
Subject: Re: GZ better than LZO?

Are you getting this results because of the nature of test data
generated?

Would you mind sharing some details about the test client and the data
it generates?

On Thu, Jul 28, 2011 at 7:01 PM, Steinmaurer Thomas <
Thomas.Steinmaurer@scch.at> wrote:

> Hello,
>
>
>
> we ran a test client generating data into GZ and LZO compressed table.
> Equal data sets (number of rows: 1008000 and the same table schema). ~
> 7.78 GB disk space uncompressed in HDFS. LZO is ~ 887 MB whereas GZ is

> ~
> 444 MB, so basically half of LZO.
>
>
>
> Execution time of the data generating client was 1373 seconds into the

> uncompressed table, 3374 sec. into LZO and 2198 sec. into GZ. The data

> generation client is based on HTablePool and using batch operations.
>
>
>
> So in our (simple) test, GZ beats LZO in both, disk usage and 
> execution time of the client. We haven't tried reads yet.
>
>
>
> Is this an expected result? I thought LZO is the recommended 
> compression algorithm? Or does LZO outperforms GZ with a growing 
> amount of data or in read scenarios?
>
>
>
> Regards,
>
> Thomas
>
>
>
>

Re: GZ better than LZO?

Posted by Chiku <ha...@gmail.com>.

Are you getting this results because of the nature of test data generated?

Would you mind sharing some details about the test client and the data it
generates?


On Thu, Jul 28, 2011 at 7:01 PM, Steinmaurer Thomas <
Thomas.Steinmaurer@scch.at> wrote:

> Hello,
>
>
>
> we ran a test client generating data into GZ and LZO compressed table.
> Equal data sets (number of rows: 1008000 and the same table schema). ~
> 7.78 GB disk space uncompressed in HDFS. LZO is ~ 887 MB whereas GZ is ~
> 444 MB, so basically half of LZO.
>
>
>
> Execution time of the data generating client was 1373 seconds into the
> uncompressed table, 3374 sec. into LZO and 2198 sec. into GZ. The data
> generation client is based on HTablePool and using batch operations.
>
>
>
> So in our (simple) test, GZ beats LZO in both, disk usage and execution
> time of the client. We haven't tried reads yet.
>
>
>
> Is this an expected result? I thought LZO is the recommended compression
> algorithm? Or does LZO outperforms GZ with a growing amount of data or
> in read scenarios?
>
>
>
> Regards,
>
> Thomas
>
>
>
>

Re: GZ better than LZO?

Posted by MrBSD <Kd...@hotmail.com>.

As a small comment,
there have been several comparisons attempts for fast compression algorithms
lately, and LZ4 has been several times mentioned to come out on top as far as
speed is concerned.
(as an example : http://reboot.pro/15285/)

The source code is BSD, and available at Google Code :
http://code.google.com/p/lz4/

Re: GZ better than LZO?

Posted by Chris Tarnas <cf...@email.com>.

Your region distribution across the nodes is not great, for both cases most of your data is going to one server, spreading the regions out across multiple servers would be best.

How many different vehicle_ids are being used, and are they all sequential integers in your tests? Hbase performs better when not doing sequential inserts. You could try reversing the vehicle ids to get around that (see the many discussions on the list about using reverse timestamps as a rowkey)

Looking at your key construction I would suggest, unless your app requires it, to not left-pad  your ids with zeros and rather use a delimiter between the key components. That will lead to smaller keys, if you use a tab as your delimiter that character falls before all other alphanumeric and punctuation characters (other than LF, CR, etc - characters that should not be in your IDs) so the keys will sort the same and left padded ones. 

I've had good luck with converting sequential numeric IDs to base 64 and then reversing them - that leads to very good key distribution across regions and shorter keys for any given number. Another option - if you don't care if your rowkeys are plaintext, is to convert the IDs to binary numbers and then reverse the bytes - that would be the most compact. If you do that you would go back to not using delimiters and just have fixed offsets for each component.

Once you have a rowkey design you can then go ahead and create your tables pre-split with multiple empty regions. That should perform much better over all for inserts, especially when the DB is new and empty to start.

How did the load with 4 million records perform?

-chris

On Jul 29, 2011, at 12:36 AM, Steinmaurer Thomas wrote:

> Hi Chris!
> 
> Your questions are somehow hard to answer for me, because I'm not really
> in charge for the test cluster from an administration/setup POV.
> 
> Basically, when running:
> http://xxx:60010/master.jsp
> 
> I see 7 region servers. Each with a "maxHeap" value of 995.
> 
> When clicking on the different tables depending on the compression type,
> I get the following information:
> 
> GZ compressed table: 3 regions hosted by one region server
> LZO compressed table: 8 regions hosted by two region servers, where the
> start region is hosted by one region server and all other 7 regions are
> hosted on the second region server
> 
> Regarding the insert pattern etc... please have a look on my reply to
> Chiku, where I describe the test data generator and the table layout etc
> ... a bit.
> 
> Thanks,
> Thomas
> 
> -----Original Message-----
> From: Christopher Tarnas [mailto:cft@tarnas.org] On Behalf Of Chris
> Tarnas
> Sent: Donnerstag, 28. Juli 2011 19:43
> To: user@hbase.apache.org
> Subject: Re: GZ better than LZO?
> 
> During the load did you add enough data to do a flush or compaction? P,
> In our cluster that amount of data inserted would not necessarily be
> enough to actually flush store files. Performance really depends on how
> the table's regions are laid out, the insert pattern, the number of
> regionservers and the amount of RAM allocated to each regionserver. If
> you don't see any flushes or compactions in the log try repeating that
> test and then flushing the table and do a compaction (or add more data
> so it happens automatically) and timing everything. It would be
> interesting to see if the GZ benefit holds up.
> 
> -chris
> 
> On Jul 28, 2011, at 6:31 AM, Steinmaurer Thomas wrote:
> 
>> Hello,
>> 
>> 
>> 
>> we ran a test client generating data into GZ and LZO compressed table.
>> Equal data sets (number of rows: 1008000 and the same table schema). ~
>> 7.78 GB disk space uncompressed in HDFS. LZO is ~ 887 MB whereas GZ is
> 
>> ~
>> 444 MB, so basically half of LZO.
>> 
>> 
>> 
>> Execution time of the data generating client was 1373 seconds into the
> 
>> uncompressed table, 3374 sec. into LZO and 2198 sec. into GZ. The data
> 
>> generation client is based on HTablePool and using batch operations.
>> 
>> 
>> 
>> So in our (simple) test, GZ beats LZO in both, disk usage and 
>> execution time of the client. We haven't tried reads yet.
>> 
>> 
>> 
>> Is this an expected result? I thought LZO is the recommended 
>> compression algorithm? Or does LZO outperforms GZ with a growing 
>> amount of data or in read scenarios?
>> 
>> 
>> 
>> Regards,
>> 
>> Thomas
>> 
>> 
>> 
>

RE: GZ better than LZO?

Posted by Steinmaurer Thomas <Th...@scch.at>.

Hi Chris!

Your questions are somehow hard to answer for me, because I'm not really
in charge for the test cluster from an administration/setup POV.

Basically, when running:
http://xxx:60010/master.jsp

I see 7 region servers. Each with a "maxHeap" value of 995.

When clicking on the different tables depending on the compression type,
I get the following information:

GZ compressed table: 3 regions hosted by one region server
LZO compressed table: 8 regions hosted by two region servers, where the
start region is hosted by one region server and all other 7 regions are
hosted on the second region server

Regarding the insert pattern etc... please have a look on my reply to
Chiku, where I describe the test data generator and the table layout etc
... a bit.

Thanks,
Thomas

-----Original Message-----
From: Christopher Tarnas [mailto:cft@tarnas.org] On Behalf Of Chris
Tarnas
Sent: Donnerstag, 28. Juli 2011 19:43
To: user@hbase.apache.org
Subject: Re: GZ better than LZO?

During the load did you add enough data to do a flush or compaction? P,
In our cluster that amount of data inserted would not necessarily be
enough to actually flush store files. Performance really depends on how
the table's regions are laid out, the insert pattern, the number of
regionservers and the amount of RAM allocated to each regionserver. If
you don't see any flushes or compactions in the log try repeating that
test and then flushing the table and do a compaction (or add more data
so it happens automatically) and timing everything. It would be
interesting to see if the GZ benefit holds up.

-chris

On Jul 28, 2011, at 6:31 AM, Steinmaurer Thomas wrote:

> Hello,
> 
> 
> 
> we ran a test client generating data into GZ and LZO compressed table.
> Equal data sets (number of rows: 1008000 and the same table schema). ~
> 7.78 GB disk space uncompressed in HDFS. LZO is ~ 887 MB whereas GZ is

> ~
> 444 MB, so basically half of LZO.
> 
> 
> 
> Execution time of the data generating client was 1373 seconds into the

> uncompressed table, 3374 sec. into LZO and 2198 sec. into GZ. The data

> generation client is based on HTablePool and using batch operations.
> 
> 
> 
> So in our (simple) test, GZ beats LZO in both, disk usage and 
> execution time of the client. We haven't tried reads yet.
> 
> 
> 
> Is this an expected result? I thought LZO is the recommended 
> compression algorithm? Or does LZO outperforms GZ with a growing 
> amount of data or in read scenarios?
> 
> 
> 
> Regards,
> 
> Thomas
> 
> 
>

Re: GZ better than LZO?

Posted by Chris Tarnas <cf...@email.com>.

During the load did you add enough data to do a flush or compaction? P, In our cluster that amount of data inserted would not necessarily be enough to actually flush store files. Performance really depends on how the table's regions are laid out, the insert pattern, the number of regionservers and the amount of RAM allocated to each regionserver. If you don't see any flushes or compactions in the log try repeating that test and then flushing the table and do a compaction (or add more data so it happens automatically) and timing everything. It would be interesting to see if the GZ benefit holds up.

-chris

On Jul 28, 2011, at 6:31 AM, Steinmaurer Thomas wrote:

> Hello,
> 
> 
> 
> we ran a test client generating data into GZ and LZO compressed table.
> Equal data sets (number of rows: 1008000 and the same table schema). ~
> 7.78 GB disk space uncompressed in HDFS. LZO is ~ 887 MB whereas GZ is ~
> 444 MB, so basically half of LZO.
> 
> 
> 
> Execution time of the data generating client was 1373 seconds into the
> uncompressed table, 3374 sec. into LZO and 2198 sec. into GZ. The data
> generation client is based on HTablePool and using batch operations.
> 
> 
> 
> So in our (simple) test, GZ beats LZO in both, disk usage and execution
> time of the client. We haven't tried reads yet.
> 
> 
> 
> Is this an expected result? I thought LZO is the recommended compression
> algorithm? Or does LZO outperforms GZ with a growing amount of data or
> in read scenarios?
> 
> 
> 
> Regards,
> 
> Thomas
> 
> 
>