You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Sandy Pratt <pr...@adobe.com> on 2011/08/17 20:54:29 UTC

RE: GZ better than LZO?

I also switched from LZO to GZ a while back.  I didn't do any micro-benchmarks, but I did note that the overall time of some MR jobs on our small cluster (~2B records at the time IIRC) went down slightly after the change.

The primary reason I switched was not due to performance, however, but due to compression ratio and licensing/build issues.  AFAIK, the GZ code is branched, tested and released along with Hadoop, whereas LZO wasn't when I last used it (not an academic concern, it turned out).

One speculation about where the discrepancy between micro-benchmarks and actual use may arise: do benchmarks include the cost of marshaling the data (64MB before compression region say) from disk?  If the benchmark starts with the data in memory (and how do you know if it does or not, given the layers of cache between you and the platters) then it might not reflect real world HBase scenarios.  GZ may need to read only 20MB while LZO might need to read 32MB.  Does that difference dominate the computational cost of decompression?


Sandy


> -----Original Message-----
> From: lars hofhansl [mailto:lhofhansl@yahoo.com]
> Sent: Friday, July 29, 2011 08:44
> To: user@hbase.apache.org
> Subject: Re: GZ better than LZO?
> 
> For what's it worth I had similar observations.
> 
> I simulated heavy write load and I found that NO compression was the
> fastest, followed by GZ, followed by LZO.
> After the tests I did a major_compact of the tables, and I included that time
> in the total.
> Also these tests where done with a single region server, in order to isolate
> compression performance better.
> 
> 
> So at least you're not the only one seeing this :) However, it seems that this
> heavily depends on the details of your setup (relative CPU vs IO
> performance, for example).
> 
> 
> ----- Original Message -----
> From: Steinmaurer Thomas <Th...@scch.at>
> To: user@hbase.apache.org
> Cc:
> Sent: Thursday, July 28, 2011 11:27 PM
> Subject: RE: GZ better than LZO?
> 
> Hello,
> 
> we simulated real looking data (as in our expected production system) in
> respect to row-key, column families ...
> 
> The test client (TDG) basically implement a three-part row key.
> 
> vehicle-device-reversedtimestamp
> 
> vehicle: 16 characters, left-padded with "0"
> device: 16 characters, left-padded with "0"
> reversedtimestamp: YYYYMMDDhhmmss
> 
> There are four column families, although currently only one called
> "data_details" is filled by the TDG. The others are reserved for later use.
> Replication (REPLICATION_SCOPE = 1) is enabled for all column families.
> 
> The qualifiers for "data_details" are basically based on an enum with 25
> members. And each member has three occurrences, defined by adding a
> different suffix to the qualifier name.
> 
> Let's say, there is an enum member called "temperature1", then there are
> the following qualifiers used:
> 
> temperature1_value
> temperature1_unit
> temperature1_validity
> 
> So, we end up with 25 * 3 = 75 qualifiers per row, filled with random values in
> a range from [0, 65535] each.
> 
> TDG basically allows to define the number of simulated clients (one thread
> per client), enabled to run them in multi-threaded mode or in single-
> threaded mode. Data volume is defined by number of iterations of the set of
> simulated clients, the number of iterations per client, number of devices per
> client and number of rows per device.
> 
> After the test has finished, 1.008.000 rows were inserted and successfully
> replicated to our backup test cluster.
> 
> Any further ideas?
> 
> PS: We are currently running a test with ~ 4mio rows following the pattern
> above.
> 
> Thanks,
> Thomas
> 
> 
> 
> -----Original Message-----
> From: Chiku [mailto:hakisenin@gmail.com]
> Sent: Donnerstag, 28. Juli 2011 15:35
> To: user@hbase.apache.org
> Subject: Re: GZ better than LZO?
> 
> Are you getting this results because of the nature of test data generated?
> 
> Would you mind sharing some details about the test client and the data it
> generates?
> 
> 
> On Thu, Jul 28, 2011 at 7:01 PM, Steinmaurer Thomas <
> Thomas.Steinmaurer@scch.at> wrote:
> 
> > Hello,
> >
> >
> >
> > we ran a test client generating data into GZ and LZO compressed table.
> > Equal data sets (number of rows: 1008000 and the same table schema). ~
> > 7.78 GB disk space uncompressed in HDFS. LZO is ~ 887 MB whereas GZ is
> 
> > ~
> > 444 MB, so basically half of LZO.
> >
> >
> >
> > Execution time of the data generating client was 1373 seconds into the
> 
> > uncompressed table, 3374 sec. into LZO and 2198 sec. into GZ. The data
> 
> > generation client is based on HTablePool and using batch operations.
> >
> >
> >
> > So in our (simple) test, GZ beats LZO in both, disk usage and
> > execution time of the client. We haven't tried reads yet.
> >
> >
> >
> > Is this an expected result? I thought LZO is the recommended
> > compression algorithm? Or does LZO outperforms GZ with a growing
> > amount of data or in read scenarios?
> >
> >
> >
> > Regards,
> >
> > Thomas
> >
> >
> >
> >

RE: GZ better than LZO?

Posted by Steinmaurer Thomas <Th...@scch.at>.

Ah, sorry. 550 mio. rows and not billions.

Thomas

-----Original Message-----
From: Steinmaurer Thomas [mailto:Thomas.Steinmaurer@scch.at] 
Sent: Donnerstag, 18. August 2011 14:55
To: user@hbase.apache.org
Subject: RE: GZ better than LZO?

After our tests with ~550 bill. rows, we probably will go with Snappy. Our test showed better write performance compared to GZ and LZO, with only slightly more disk usage compared to LZO. 

Haven't looked at comparing read performance for our pattern, but performance of Snappy should be sufficient here as well.

Regards,
Thomas

-----Original Message-----
From: BlueDavy Lin [mailto:bluedavy@gmail.com]
Sent: Donnerstag, 18. August 2011 04:06
To: user@hbase.apache.org
Subject: Re: GZ better than LZO?

We test gz also,but when we use gz,it seems will cause memory out of usage.

It seems maybe because gz not use Deflater/Inflater correctly (not call end method explicit)

2011/8/18 Sandy Pratt <pr...@adobe.com>:
> I also switched from LZO to GZ a while back.  I didn't do any micro-benchmarks, but I did note that the overall time of some MR jobs on our small cluster (~2B records at the time IIRC) went down slightly after the change.
>
> The primary reason I switched was not due to performance, however, but due to compression ratio and licensing/build issues.  AFAIK, the GZ code is branched, tested and released along with Hadoop, whereas LZO wasn't when I last used it (not an academic concern, it turned out).
>
> One speculation about where the discrepancy between micro-benchmarks and actual use may arise: do benchmarks include the cost of marshaling the data (64MB before compression region say) from disk?  If the benchmark starts with the data in memory (and how do you know if it does or not, given the layers of cache between you and the platters) then it might not reflect real world HBase scenarios.  GZ may need to read only 20MB while LZO might need to read 32MB.  Does that difference dominate the computational cost of decompression?
>
>
> Sandy
>
>
>> -----Original Message-----
>> From: lars hofhansl [mailto:lhofhansl@yahoo.com]
>> Sent: Friday, July 29, 2011 08:44
>> To: user@hbase.apache.org
>> Subject: Re: GZ better than LZO?
>>
>> For what's it worth I had similar observations.
>>
>> I simulated heavy write load and I found that NO compression was the 
>> fastest, followed by GZ, followed by LZO.
>> After the tests I did a major_compact of the tables, and I included 
>> that time in the total.
>> Also these tests where done with a single region server, in order to 
>> isolate compression performance better.
>>
>>
>> So at least you're not the only one seeing this :) However, it seems 
>> that this heavily depends on the details of your setup (relative CPU 
>> vs IO performance, for example).
>>
>>
>> ----- Original Message -----
>> From: Steinmaurer Thomas <Th...@scch.at>
>> To: user@hbase.apache.org
>> Cc:
>> Sent: Thursday, July 28, 2011 11:27 PM
>> Subject: RE: GZ better than LZO?
>>
>> Hello,
>>
>> we simulated real looking data (as in our expected production system) 
>> in respect to row-key, column families ...
>>
>> The test client (TDG) basically implement a three-part row key.
>>
>> vehicle-device-reversedtimestamp
>>
>> vehicle: 16 characters, left-padded with "0"
>> device: 16 characters, left-padded with "0"
>> reversedtimestamp: YYYYMMDDhhmmss
>>
>> There are four column families, although currently only one called 
>> "data_details" is filled by the TDG. The others are reserved for later use.
>> Replication (REPLICATION_SCOPE = 1) is enabled for all column families.
>>
>> The qualifiers for "data_details" are basically based on an enum with
>> 25 members. And each member has three occurrences, defined by adding 
>> a different suffix to the qualifier name.
>>
>> Let's say, there is an enum member called "temperature1", then there 
>> are the following qualifiers used:
>>
>> temperature1_value
>> temperature1_unit
>> temperature1_validity
>>
>> So, we end up with 25 * 3 = 75 qualifiers per row, filled with random 
>> values in a range from [0, 65535] each.
>>
>> TDG basically allows to define the number of simulated clients (one 
>> thread per client), enabled to run them in multi-threaded mode or in
>> single- threaded mode. Data volume is defined by number of iterations 
>> of the set of simulated clients, the number of iterations per client, 
>> number of devices per client and number of rows per device.
>>
>> After the test has finished, 1.008.000 rows were inserted and 
>> successfully replicated to our backup test cluster.
>>
>> Any further ideas?
>>
>> PS: We are currently running a test with ~ 4mio rows following the 
>> pattern above.
>>
>> Thanks,
>> Thomas
>>
>>
>>
>> -----Original Message-----
>> From: Chiku [mailto:hakisenin@gmail.com]
>> Sent: Donnerstag, 28. Juli 2011 15:35
>> To: user@hbase.apache.org
>> Subject: Re: GZ better than LZO?
>>
>> Are you getting this results because of the nature of test data generated?
>>
>> Would you mind sharing some details about the test client and the 
>> data it generates?
>>
>>
>> On Thu, Jul 28, 2011 at 7:01 PM, Steinmaurer Thomas < 
>> Thomas.Steinmaurer@scch.at> wrote:
>>
>> > Hello,
>> >
>> >
>> >
>> > we ran a test client generating data into GZ and LZO compressed table.
>> > Equal data sets (number of rows: 1008000 and the same table 
>> > schema). ~
>> > 7.78 GB disk space uncompressed in HDFS. LZO is ~ 887 MB whereas GZ 
>> > is
>>
>> > ~
>> > 444 MB, so basically half of LZO.
>> >
>> >
>> >
>> > Execution time of the data generating client was 1373 seconds into 
>> > the
>>
>> > uncompressed table, 3374 sec. into LZO and 2198 sec. into GZ. The 
>> > data
>>
>> > generation client is based on HTablePool and using batch operations.
>> >
>> >
>> >
>> > So in our (simple) test, GZ beats LZO in both, disk usage and 
>> > execution time of the client. We haven't tried reads yet.
>> >
>> >
>> >
>> > Is this an expected result? I thought LZO is the recommended 
>> > compression algorithm? Or does LZO outperforms GZ with a growing 
>> > amount of data or in read scenarios?
>> >
>> >
>> >
>> > Regards,
>> >
>> > Thomas
>> >
>> >
>> >
>> >
>
>



--
=============================
|     BlueDavy                                      |
|     http://www.bluedavy.com                |
=============================

RE: GZ better than LZO?

Posted by Steinmaurer Thomas <Th...@scch.at>.

After our tests with ~550 bill. rows, we probably will go with Snappy. Our test showed better write performance compared to GZ and LZO, with only slightly more disk usage compared to LZO. 

Haven't looked at comparing read performance for our pattern, but performance of Snappy should be sufficient here as well.

Regards,
Thomas

-----Original Message-----
From: BlueDavy Lin [mailto:bluedavy@gmail.com] 
Sent: Donnerstag, 18. August 2011 04:06
To: user@hbase.apache.org
Subject: Re: GZ better than LZO?

We test gz also,but when we use gz,it seems will cause memory out of usage.

It seems maybe because gz not use Deflater/Inflater correctly (not call end method explicit)

2011/8/18 Sandy Pratt <pr...@adobe.com>:
> I also switched from LZO to GZ a while back.  I didn't do any micro-benchmarks, but I did note that the overall time of some MR jobs on our small cluster (~2B records at the time IIRC) went down slightly after the change.
>
> The primary reason I switched was not due to performance, however, but due to compression ratio and licensing/build issues.  AFAIK, the GZ code is branched, tested and released along with Hadoop, whereas LZO wasn't when I last used it (not an academic concern, it turned out).
>
> One speculation about where the discrepancy between micro-benchmarks and actual use may arise: do benchmarks include the cost of marshaling the data (64MB before compression region say) from disk?  If the benchmark starts with the data in memory (and how do you know if it does or not, given the layers of cache between you and the platters) then it might not reflect real world HBase scenarios.  GZ may need to read only 20MB while LZO might need to read 32MB.  Does that difference dominate the computational cost of decompression?
>
>
> Sandy
>
>
>> -----Original Message-----
>> From: lars hofhansl [mailto:lhofhansl@yahoo.com]
>> Sent: Friday, July 29, 2011 08:44
>> To: user@hbase.apache.org
>> Subject: Re: GZ better than LZO?
>>
>> For what's it worth I had similar observations.
>>
>> I simulated heavy write load and I found that NO compression was the 
>> fastest, followed by GZ, followed by LZO.
>> After the tests I did a major_compact of the tables, and I included 
>> that time in the total.
>> Also these tests where done with a single region server, in order to 
>> isolate compression performance better.
>>
>>
>> So at least you're not the only one seeing this :) However, it seems 
>> that this heavily depends on the details of your setup (relative CPU 
>> vs IO performance, for example).
>>
>>
>> ----- Original Message -----
>> From: Steinmaurer Thomas <Th...@scch.at>
>> To: user@hbase.apache.org
>> Cc:
>> Sent: Thursday, July 28, 2011 11:27 PM
>> Subject: RE: GZ better than LZO?
>>
>> Hello,
>>
>> we simulated real looking data (as in our expected production system) 
>> in respect to row-key, column families ...
>>
>> The test client (TDG) basically implement a three-part row key.
>>
>> vehicle-device-reversedtimestamp
>>
>> vehicle: 16 characters, left-padded with "0"
>> device: 16 characters, left-padded with "0"
>> reversedtimestamp: YYYYMMDDhhmmss
>>
>> There are four column families, although currently only one called 
>> "data_details" is filled by the TDG. The others are reserved for later use.
>> Replication (REPLICATION_SCOPE = 1) is enabled for all column families.
>>
>> The qualifiers for "data_details" are basically based on an enum with 
>> 25 members. And each member has three occurrences, defined by adding 
>> a different suffix to the qualifier name.
>>
>> Let's say, there is an enum member called "temperature1", then there 
>> are the following qualifiers used:
>>
>> temperature1_value
>> temperature1_unit
>> temperature1_validity
>>
>> So, we end up with 25 * 3 = 75 qualifiers per row, filled with random 
>> values in a range from [0, 65535] each.
>>
>> TDG basically allows to define the number of simulated clients (one 
>> thread per client), enabled to run them in multi-threaded mode or in 
>> single- threaded mode. Data volume is defined by number of iterations 
>> of the set of simulated clients, the number of iterations per client, 
>> number of devices per client and number of rows per device.
>>
>> After the test has finished, 1.008.000 rows were inserted and 
>> successfully replicated to our backup test cluster.
>>
>> Any further ideas?
>>
>> PS: We are currently running a test with ~ 4mio rows following the 
>> pattern above.
>>
>> Thanks,
>> Thomas
>>
>>
>>
>> -----Original Message-----
>> From: Chiku [mailto:hakisenin@gmail.com]
>> Sent: Donnerstag, 28. Juli 2011 15:35
>> To: user@hbase.apache.org
>> Subject: Re: GZ better than LZO?
>>
>> Are you getting this results because of the nature of test data generated?
>>
>> Would you mind sharing some details about the test client and the 
>> data it generates?
>>
>>
>> On Thu, Jul 28, 2011 at 7:01 PM, Steinmaurer Thomas < 
>> Thomas.Steinmaurer@scch.at> wrote:
>>
>> > Hello,
>> >
>> >
>> >
>> > we ran a test client generating data into GZ and LZO compressed table.
>> > Equal data sets (number of rows: 1008000 and the same table 
>> > schema). ~
>> > 7.78 GB disk space uncompressed in HDFS. LZO is ~ 887 MB whereas GZ 
>> > is
>>
>> > ~
>> > 444 MB, so basically half of LZO.
>> >
>> >
>> >
>> > Execution time of the data generating client was 1373 seconds into 
>> > the
>>
>> > uncompressed table, 3374 sec. into LZO and 2198 sec. into GZ. The 
>> > data
>>
>> > generation client is based on HTablePool and using batch operations.
>> >
>> >
>> >
>> > So in our (simple) test, GZ beats LZO in both, disk usage and 
>> > execution time of the client. We haven't tried reads yet.
>> >
>> >
>> >
>> > Is this an expected result? I thought LZO is the recommended 
>> > compression algorithm? Or does LZO outperforms GZ with a growing 
>> > amount of data or in read scenarios?
>> >
>> >
>> >
>> > Regards,
>> >
>> > Thomas
>> >
>> >
>> >
>> >
>
>



--
=============================
|     BlueDavy                                      |
|     http://www.bluedavy.com                |
=============================

RE: GZ better than LZO?

Posted by "Srikanth P. Shreenivas" <Sr...@mindtree.com>.

If it helps, I tried LZO setup on CDH3 on my Ubuntu.
I have documented the steps here, should work fine for others too.

http://www.srikanthps.com/2011/08/configuring-lzo-compression-for-cdh3.html

Regards,
Srikanth


________________________________________
From: Sandy Pratt [prattrs@adobe.com]
Sent: Friday, August 19, 2011 12:21 AM
To: user@hbase.apache.org
Subject: RE: GZ better than LZO?

You're definitely going to want to use the native libraries for zlib and gzip.

http://hadoop.apache.org/common/docs/current/native_libraries.html

It's actually a fairly easy build, and it comes out of the box with CDH IIRC.  You can put a symlink to hadoop/lib/native in hbase/lib and you're done.

When HBase falls back to Java for GZ and zlib, it will definitely be a bad thing =/

Sandy


> -----Original Message-----
> From: BlueDavy Lin [mailto:bluedavy@gmail.com]
> Sent: Wednesday, August 17, 2011 19:07
> To: user@hbase.apache.org
> Subject: Re: GZ better than LZO?
>
> We test gz also,but when we use gz,it seems will cause memory out of
> usage.
>
> It seems maybe because gz not use Deflater/Inflater correctly (not call end
> method explicit)
>
> 2011/8/18 Sandy Pratt <pr...@adobe.com>:
> > I also switched from LZO to GZ a while back.  I didn't do any micro-
> benchmarks, but I did note that the overall time of some MR jobs on our
> small cluster (~2B records at the time IIRC) went down slightly after the
> change.
> >
> > The primary reason I switched was not due to performance, however, but
> due to compression ratio and licensing/build issues.  AFAIK, the GZ code is
> branched, tested and released along with Hadoop, whereas LZO wasn't
> when I last used it (not an academic concern, it turned out).
> >
> > One speculation about where the discrepancy between micro-benchmarks
> and actual use may arise: do benchmarks include the cost of marshaling the
> data (64MB before compression region say) from disk?  If the benchmark
> starts with the data in memory (and how do you know if it does or not, given
> the layers of cache between you and the platters) then it might not reflect
> real world HBase scenarios.  GZ may need to read only 20MB while LZO might
> need to read 32MB.  Does that difference dominate the computational cost
> of decompression?
> >
> >
> > Sandy
> >
> >
> >> -----Original Message-----
> >> From: lars hofhansl [mailto:lhofhansl@yahoo.com]
> >> Sent: Friday, July 29, 2011 08:44
> >> To: user@hbase.apache.org
> >> Subject: Re: GZ better than LZO?
> >>
> >> For what's it worth I had similar observations.
> >>
> >> I simulated heavy write load and I found that NO compression was the
> >> fastest, followed by GZ, followed by LZO.
> >> After the tests I did a major_compact of the tables, and I included
> >> that time in the total.
> >> Also these tests where done with a single region server, in order to
> >> isolate compression performance better.
> >>
> >>
> >> So at least you're not the only one seeing this :) However, it seems
> >> that this heavily depends on the details of your setup (relative CPU
> >> vs IO performance, for example).
> >>
> >>
> >> ----- Original Message -----
> >> From: Steinmaurer Thomas <Th...@scch.at>
> >> To: user@hbase.apache.org
> >> Cc:
> >> Sent: Thursday, July 28, 2011 11:27 PM
> >> Subject: RE: GZ better than LZO?
> >>
> >> Hello,
> >>
> >> we simulated real looking data (as in our expected production system)
> >> in respect to row-key, column families ...
> >>
> >> The test client (TDG) basically implement a three-part row key.
> >>
> >> vehicle-device-reversedtimestamp
> >>
> >> vehicle: 16 characters, left-padded with "0"
> >> device: 16 characters, left-padded with "0"
> >> reversedtimestamp: YYYYMMDDhhmmss
> >>
> >> There are four column families, although currently only one called
> >> "data_details" is filled by the TDG. The others are reserved for later use.
> >> Replication (REPLICATION_SCOPE = 1) is enabled for all column families.
> >>
> >> The qualifiers for "data_details" are basically based on an enum with
> >> 25 members. And each member has three occurrences, defined by adding
> >> a different suffix to the qualifier name.
> >>
> >> Let's say, there is an enum member called "temperature1", then there
> >> are the following qualifiers used:
> >>
> >> temperature1_value
> >> temperature1_unit
> >> temperature1_validity
> >>
> >> So, we end up with 25 * 3 = 75 qualifiers per row, filled with random
> >> values in a range from [0, 65535] each.
> >>
> >> TDG basically allows to define the number of simulated clients (one
> >> thread per client), enabled to run them in multi-threaded mode or in
> >> single- threaded mode. Data volume is defined by number of iterations
> >> of the set of simulated clients, the number of iterations per client,
> >> number of devices per client and number of rows per device.
> >>
> >> After the test has finished, 1.008.000 rows were inserted and
> >> successfully replicated to our backup test cluster.
> >>
> >> Any further ideas?
> >>
> >> PS: We are currently running a test with ~ 4mio rows following the
> >> pattern above.
> >>
> >> Thanks,
> >> Thomas
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: Chiku [mailto:hakisenin@gmail.com]
> >> Sent: Donnerstag, 28. Juli 2011 15:35
> >> To: user@hbase.apache.org
> >> Subject: Re: GZ better than LZO?
> >>
> >> Are you getting this results because of the nature of test data generated?
> >>
> >> Would you mind sharing some details about the test client and the
> >> data it generates?
> >>
> >>
> >> On Thu, Jul 28, 2011 at 7:01 PM, Steinmaurer Thomas <
> >> Thomas.Steinmaurer@scch.at> wrote:
> >>
> >> > Hello,
> >> >
> >> >
> >> >
> >> > we ran a test client generating data into GZ and LZO compressed table.
> >> > Equal data sets (number of rows: 1008000 and the same table
> >> > schema). ~
> >> > 7.78 GB disk space uncompressed in HDFS. LZO is ~ 887 MB whereas GZ
> >> > is
> >>
> >> > ~
> >> > 444 MB, so basically half of LZO.
> >> >
> >> >
> >> >
> >> > Execution time of the data generating client was 1373 seconds into
> >> > the
> >>
> >> > uncompressed table, 3374 sec. into LZO and 2198 sec. into GZ. The
> >> > data
> >>
> >> > generation client is based on HTablePool and using batch operations.
> >> >
> >> >
> >> >
> >> > So in our (simple) test, GZ beats LZO in both, disk usage and
> >> > execution time of the client. We haven't tried reads yet.
> >> >
> >> >
> >> >
> >> > Is this an expected result? I thought LZO is the recommended
> >> > compression algorithm? Or does LZO outperforms GZ with a growing
> >> > amount of data or in read scenarios?
> >> >
> >> >
> >> >
> >> > Regards,
> >> >
> >> > Thomas
> >> >
> >> >
> >> >
> >> >
> >
> >
>
>
>
> --
> =============================
> |     BlueDavy                                      |
> |     http://www.bluedavy.com                |
> =============================

________________________________

http://www.mindtree.com/email/disclaimer.html

RE: GZ better than LZO?

Posted by Sandy Pratt <pr...@adobe.com>.

You're definitely going to want to use the native libraries for zlib and gzip.

http://hadoop.apache.org/common/docs/current/native_libraries.html

It's actually a fairly easy build, and it comes out of the box with CDH IIRC.  You can put a symlink to hadoop/lib/native in hbase/lib and you're done.

When HBase falls back to Java for GZ and zlib, it will definitely be a bad thing =/

Sandy


> -----Original Message-----
> From: BlueDavy Lin [mailto:bluedavy@gmail.com]
> Sent: Wednesday, August 17, 2011 19:07
> To: user@hbase.apache.org
> Subject: Re: GZ better than LZO?
> 
> We test gz also,but when we use gz,it seems will cause memory out of
> usage.
> 
> It seems maybe because gz not use Deflater/Inflater correctly (not call end
> method explicit)
> 
> 2011/8/18 Sandy Pratt <pr...@adobe.com>:
> > I also switched from LZO to GZ a while back.  I didn't do any micro-
> benchmarks, but I did note that the overall time of some MR jobs on our
> small cluster (~2B records at the time IIRC) went down slightly after the
> change.
> >
> > The primary reason I switched was not due to performance, however, but
> due to compression ratio and licensing/build issues.  AFAIK, the GZ code is
> branched, tested and released along with Hadoop, whereas LZO wasn't
> when I last used it (not an academic concern, it turned out).
> >
> > One speculation about where the discrepancy between micro-benchmarks
> and actual use may arise: do benchmarks include the cost of marshaling the
> data (64MB before compression region say) from disk?  If the benchmark
> starts with the data in memory (and how do you know if it does or not, given
> the layers of cache between you and the platters) then it might not reflect
> real world HBase scenarios.  GZ may need to read only 20MB while LZO might
> need to read 32MB.  Does that difference dominate the computational cost
> of decompression?
> >
> >
> > Sandy
> >
> >
> >> -----Original Message-----
> >> From: lars hofhansl [mailto:lhofhansl@yahoo.com]
> >> Sent: Friday, July 29, 2011 08:44
> >> To: user@hbase.apache.org
> >> Subject: Re: GZ better than LZO?
> >>
> >> For what's it worth I had similar observations.
> >>
> >> I simulated heavy write load and I found that NO compression was the
> >> fastest, followed by GZ, followed by LZO.
> >> After the tests I did a major_compact of the tables, and I included
> >> that time in the total.
> >> Also these tests where done with a single region server, in order to
> >> isolate compression performance better.
> >>
> >>
> >> So at least you're not the only one seeing this :) However, it seems
> >> that this heavily depends on the details of your setup (relative CPU
> >> vs IO performance, for example).
> >>
> >>
> >> ----- Original Message -----
> >> From: Steinmaurer Thomas <Th...@scch.at>
> >> To: user@hbase.apache.org
> >> Cc:
> >> Sent: Thursday, July 28, 2011 11:27 PM
> >> Subject: RE: GZ better than LZO?
> >>
> >> Hello,
> >>
> >> we simulated real looking data (as in our expected production system)
> >> in respect to row-key, column families ...
> >>
> >> The test client (TDG) basically implement a three-part row key.
> >>
> >> vehicle-device-reversedtimestamp
> >>
> >> vehicle: 16 characters, left-padded with "0"
> >> device: 16 characters, left-padded with "0"
> >> reversedtimestamp: YYYYMMDDhhmmss
> >>
> >> There are four column families, although currently only one called
> >> "data_details" is filled by the TDG. The others are reserved for later use.
> >> Replication (REPLICATION_SCOPE = 1) is enabled for all column families.
> >>
> >> The qualifiers for "data_details" are basically based on an enum with
> >> 25 members. And each member has three occurrences, defined by adding
> >> a different suffix to the qualifier name.
> >>
> >> Let's say, there is an enum member called "temperature1", then there
> >> are the following qualifiers used:
> >>
> >> temperature1_value
> >> temperature1_unit
> >> temperature1_validity
> >>
> >> So, we end up with 25 * 3 = 75 qualifiers per row, filled with random
> >> values in a range from [0, 65535] each.
> >>
> >> TDG basically allows to define the number of simulated clients (one
> >> thread per client), enabled to run them in multi-threaded mode or in
> >> single- threaded mode. Data volume is defined by number of iterations
> >> of the set of simulated clients, the number of iterations per client,
> >> number of devices per client and number of rows per device.
> >>
> >> After the test has finished, 1.008.000 rows were inserted and
> >> successfully replicated to our backup test cluster.
> >>
> >> Any further ideas?
> >>
> >> PS: We are currently running a test with ~ 4mio rows following the
> >> pattern above.
> >>
> >> Thanks,
> >> Thomas
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: Chiku [mailto:hakisenin@gmail.com]
> >> Sent: Donnerstag, 28. Juli 2011 15:35
> >> To: user@hbase.apache.org
> >> Subject: Re: GZ better than LZO?
> >>
> >> Are you getting this results because of the nature of test data generated?
> >>
> >> Would you mind sharing some details about the test client and the
> >> data it generates?
> >>
> >>
> >> On Thu, Jul 28, 2011 at 7:01 PM, Steinmaurer Thomas <
> >> Thomas.Steinmaurer@scch.at> wrote:
> >>
> >> > Hello,
> >> >
> >> >
> >> >
> >> > we ran a test client generating data into GZ and LZO compressed table.
> >> > Equal data sets (number of rows: 1008000 and the same table
> >> > schema). ~
> >> > 7.78 GB disk space uncompressed in HDFS. LZO is ~ 887 MB whereas GZ
> >> > is
> >>
> >> > ~
> >> > 444 MB, so basically half of LZO.
> >> >
> >> >
> >> >
> >> > Execution time of the data generating client was 1373 seconds into
> >> > the
> >>
> >> > uncompressed table, 3374 sec. into LZO and 2198 sec. into GZ. The
> >> > data
> >>
> >> > generation client is based on HTablePool and using batch operations.
> >> >
> >> >
> >> >
> >> > So in our (simple) test, GZ beats LZO in both, disk usage and
> >> > execution time of the client. We haven't tried reads yet.
> >> >
> >> >
> >> >
> >> > Is this an expected result? I thought LZO is the recommended
> >> > compression algorithm? Or does LZO outperforms GZ with a growing
> >> > amount of data or in read scenarios?
> >> >
> >> >
> >> >
> >> > Regards,
> >> >
> >> > Thomas
> >> >
> >> >
> >> >
> >> >
> >
> >
> 
> 
> 
> --
> =============================
> |     BlueDavy                                      |
> |     http://www.bluedavy.com                |
> =============================

Re: GZ better than LZO?

Posted by BlueDavy Lin <bl...@gmail.com>.

We test gz also,but when we use gz,it seems will cause memory out of usage.

It seems maybe because gz not use Deflater/Inflater correctly (not
call end method explicit)

2011/8/18 Sandy Pratt <pr...@adobe.com>:
> I also switched from LZO to GZ a while back.  I didn't do any micro-benchmarks, but I did note that the overall time of some MR jobs on our small cluster (~2B records at the time IIRC) went down slightly after the change.
>
> The primary reason I switched was not due to performance, however, but due to compression ratio and licensing/build issues.  AFAIK, the GZ code is branched, tested and released along with Hadoop, whereas LZO wasn't when I last used it (not an academic concern, it turned out).
>
> One speculation about where the discrepancy between micro-benchmarks and actual use may arise: do benchmarks include the cost of marshaling the data (64MB before compression region say) from disk?  If the benchmark starts with the data in memory (and how do you know if it does or not, given the layers of cache between you and the platters) then it might not reflect real world HBase scenarios.  GZ may need to read only 20MB while LZO might need to read 32MB.  Does that difference dominate the computational cost of decompression?
>
>
> Sandy
>
>
>> -----Original Message-----
>> From: lars hofhansl [mailto:lhofhansl@yahoo.com]
>> Sent: Friday, July 29, 2011 08:44
>> To: user@hbase.apache.org
>> Subject: Re: GZ better than LZO?
>>
>> For what's it worth I had similar observations.
>>
>> I simulated heavy write load and I found that NO compression was the
>> fastest, followed by GZ, followed by LZO.
>> After the tests I did a major_compact of the tables, and I included that time
>> in the total.
>> Also these tests where done with a single region server, in order to isolate
>> compression performance better.
>>
>>
>> So at least you're not the only one seeing this :) However, it seems that this
>> heavily depends on the details of your setup (relative CPU vs IO
>> performance, for example).
>>
>>
>> ----- Original Message -----
>> From: Steinmaurer Thomas <Th...@scch.at>
>> To: user@hbase.apache.org
>> Cc:
>> Sent: Thursday, July 28, 2011 11:27 PM
>> Subject: RE: GZ better than LZO?
>>
>> Hello,
>>
>> we simulated real looking data (as in our expected production system) in
>> respect to row-key, column families ...
>>
>> The test client (TDG) basically implement a three-part row key.
>>
>> vehicle-device-reversedtimestamp
>>
>> vehicle: 16 characters, left-padded with "0"
>> device: 16 characters, left-padded with "0"
>> reversedtimestamp: YYYYMMDDhhmmss
>>
>> There are four column families, although currently only one called
>> "data_details" is filled by the TDG. The others are reserved for later use.
>> Replication (REPLICATION_SCOPE = 1) is enabled for all column families.
>>
>> The qualifiers for "data_details" are basically based on an enum with 25
>> members. And each member has three occurrences, defined by adding a
>> different suffix to the qualifier name.
>>
>> Let's say, there is an enum member called "temperature1", then there are
>> the following qualifiers used:
>>
>> temperature1_value
>> temperature1_unit
>> temperature1_validity
>>
>> So, we end up with 25 * 3 = 75 qualifiers per row, filled with random values in
>> a range from [0, 65535] each.
>>
>> TDG basically allows to define the number of simulated clients (one thread
>> per client), enabled to run them in multi-threaded mode or in single-
>> threaded mode. Data volume is defined by number of iterations of the set of
>> simulated clients, the number of iterations per client, number of devices per
>> client and number of rows per device.
>>
>> After the test has finished, 1.008.000 rows were inserted and successfully
>> replicated to our backup test cluster.
>>
>> Any further ideas?
>>
>> PS: We are currently running a test with ~ 4mio rows following the pattern
>> above.
>>
>> Thanks,
>> Thomas
>>
>>
>>
>> -----Original Message-----
>> From: Chiku [mailto:hakisenin@gmail.com]
>> Sent: Donnerstag, 28. Juli 2011 15:35
>> To: user@hbase.apache.org
>> Subject: Re: GZ better than LZO?
>>
>> Are you getting this results because of the nature of test data generated?
>>
>> Would you mind sharing some details about the test client and the data it
>> generates?
>>
>>
>> On Thu, Jul 28, 2011 at 7:01 PM, Steinmaurer Thomas <
>> Thomas.Steinmaurer@scch.at> wrote:
>>
>> > Hello,
>> >
>> >
>> >
>> > we ran a test client generating data into GZ and LZO compressed table.
>> > Equal data sets (number of rows: 1008000 and the same table schema). ~
>> > 7.78 GB disk space uncompressed in HDFS. LZO is ~ 887 MB whereas GZ is
>>
>> > ~
>> > 444 MB, so basically half of LZO.
>> >
>> >
>> >
>> > Execution time of the data generating client was 1373 seconds into the
>>
>> > uncompressed table, 3374 sec. into LZO and 2198 sec. into GZ. The data
>>
>> > generation client is based on HTablePool and using batch operations.
>> >
>> >
>> >
>> > So in our (simple) test, GZ beats LZO in both, disk usage and
>> > execution time of the client. We haven't tried reads yet.
>> >
>> >
>> >
>> > Is this an expected result? I thought LZO is the recommended
>> > compression algorithm? Or does LZO outperforms GZ with a growing
>> > amount of data or in read scenarios?
>> >
>> >
>> >
>> > Regards,
>> >
>> > Thomas
>> >
>> >
>> >
>> >
>
>



-- 
=============================
|     BlueDavy                                      |
|     http://www.bluedavy.com                |
=============================