You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "German Florez-Larrahondo (JIRA)" <ji...@apache.org> on 2013/07/29 18:25:49 UTC

[jira] [Created] (HADOOP-9785) LZ4 code may need upgrade (lz4.c embedded in libHadoop is r43 18 months ago, while latest version is r98)

German Florez-Larrahondo created HADOOP-9785:
------------------------------------------------

             Summary: LZ4 code may need upgrade (lz4.c embedded in libHadoop is r43 18 months ago, while latest version is r98)
                 Key: HADOOP-9785
                 URL: https://issues.apache.org/jira/browse/HADOOP-9785
             Project: Hadoop Common
          Issue Type: Improvement
          Components: io, native
    Affects Versions: 2.0.4-alpha, 3.0.0
         Environment: [german@localhost lz4-read-only]$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    1
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 23
Stepping:              10
CPU MHz:               2667.000
BogoMIPS:              5319.82
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              2048K
NUMA node0 CPU(s):     0-3

[german@localhost lz4-read-only]$ uname -r
2.6.32-358.14.1.el6.x86_64

            Reporter: German Florez-Larrahondo
            Priority: Minor
             Fix For: 3.0.0, 2.0.4-alpha


While analyzing compression performance of different Hadoop codecs I noticed that the LZ4 code was taken from revision 43 of https://code.google.com/p/lz4/. The latest version is r98 and there may be extra performance benefits we can gain from using r98. 

We may involve the original LZ4 author Yann Collet on these discussions, as the current LZ4 code includes additional algorithms and parameters. 

To start the investigation, I ran preliminary experiments with the Silesia corpus and there seems to be an improvement on throughput for compression and decompression in the latest release when compared with r43 (haven't done enough analysis to conclude anything statistically, but looks good).  

Here is raw output using LZ4 from r43 with a SUBSET of the silesia corpus (http://sun.aei.polsl.pl/~sdeor/index.php?page=silesia)

File: silesia/dickens
*** Compression CLI using LZ4 algorithm , by Yann Collet (Jul 29 2013) ***
Compressed 10192446 bytes into 6433123 bytes ==> 63.12%
Done in 0.07 s ==> 138.86 MB/s
*** Compression CLI using LZ4 algorithm , by Yann Collet (Jul 29 2013) ***
Successfully decoded 10192446 bytes
Done in 0.02 s ==> 486.01 MB/s

File: silesia/mozilla
*** Compression CLI using LZ4 algorithm , by Yann Collet (Jul 29 2013) ***
Compressed 51220480 bytes into 26379814 bytes ==> 51.50%
Done in 0.25 s ==> 195.39 MB/s
*** Compression CLI using LZ4 algorithm , by Yann Collet (Jul 29 2013) ***
Successfully decoded 51220480 bytes
Done in 0.12 s ==> 407.06 MB/s

File: silesia/mr
*** Compression CLI using LZ4 algorithm , by Yann Collet (Jul 29 2013) ***
Compressed 9970564 bytes into 5669268 bytes ==> 56.86%
Done in 0.04 s ==> 237.72 MB/s
*** Compression CLI using LZ4 algorithm , by Yann Collet (Jul 29 2013) ***
Successfully decoded 9970564 bytes
Done in 0.02 s ==> 475.43 MB/s

File: silesia/nci
*** Compression CLI using LZ4 algorithm , by Yann Collet (Jul 29 2013) ***
Compressed 33553445 bytes into 5880292 bytes ==> 17.53%
Done in 0.08 s ==> 399.99 MB/s
*** Compression CLI using LZ4 algorithm , by Yann Collet (Jul 29 2013) ***
Successfully decoded 33553445 bytes
Done in 0.06 s ==> 533.32 MB/s

And here raw output of LZ4 from the latest release r98

File: silesia/dickens
*** Full LZ4 speed analyzer , by Yann Collet (Jul 29 2013) ***
Loading silesia/dickens...
1-LZ4_compress        :  10192446 ->^M1-LZ4_compress        :  10192446 ->   6434313 (63.13%),  172.3 MB/s
1-LZ4_decompress_fast :  10192446 ->^M1-LZ4_decompress_fast :  10192446 ->   676.0 MB/s^MLZ4_decompress_fast   :  10192446 ->   676.0 MB/s

File: silesia/mozilla
*** Full LZ4 speed analyzer , by Yann Collet (Jul 29 2013) ***
Loading silesia/mozilla...
1-LZ4_compress        :  51220480 ->^M1-LZ4_compress        :  51220480 ->  26382113 (51.51%),  281.7 MB/s
1-LZ4_decompress_fast :  51220480 ->^M1-LZ4_decompress_fast :  51220480 ->  1003.1 MB/s^MLZ4_decompress_fast   :  51220480 ->  1003.1 MB/s

File: silesia/mr
*** Full LZ4 speed analyzer , by Yann Collet (Jul 29 2013) ***
Loading silesia/mr...
1-LZ4_compress        :   9970564 ->^M1-LZ4_compress        :   9970564 ->   5669255 (56.86%),  268.3 MB/s
1-LZ4_decompress_fast :   9970564 ->^M1-LZ4_decompress_fast :   9970564 ->   788.7 MB/s^MLZ4_decompress_fast   :   9970564 ->   788.7 MB/s

File: silesia/nci
*** Full LZ4 speed analyzer , by Yann Collet (Jul 29 2013) ***
Loading silesia/nci...
1-LZ4_compress        :  33553445 ->^M1-LZ4_compress        :  33553445 ->   5883923 (17.54%),  584.9 MB
1-LZ4_decompress_fast :  33553445 ->^M1-LZ4_decompress_fast :  33553445 ->  1208.3 MB/s^MLZ4_decompress_fast   :  33553445 ->  1208.3 MB/s


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira