You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-issues@hadoop.apache.org by "Nicholas Carlini (JIRA)" <ji...@apache.org> on 2010/06/23 23:56:49 UTC

[jira] Created: (HADOOP-6837) Support for LZMA compression

Support for LZMA compression
----------------------------

                 Key: HADOOP-6837
                 URL: https://issues.apache.org/jira/browse/HADOOP-6837
             Project: Hadoop Common
          Issue Type: Improvement
          Components: io
            Reporter: Nicholas Carlini
         Attachments: HADOOP-6837-lzma-java-20100623.patch

Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6837) Support for LZMA compression

Posted by "Jakob Homan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897400#action_12897400 ] 

Jakob Homan commented on HADOOP-6837:
-------------------------------------

It may be good to address the code style issue now, since this patch diverges significantly from our standard: http://wiki.apache.org/hadoop/CodeReviewChecklist  Eclipse can re-format everything into Hadoop's style pretty well, if that will save time.

> Support for LZMA compression
> ----------------------------
>
>                 Key: HADOOP-6837
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6837
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Nicholas Carlini
>            Assignee: Nicholas Carlini
>         Attachments: HADOOP-6837-lzma-1-20100722.non-trivial.pseudo-patch, HADOOP-6837-lzma-1-20100722.patch, HADOOP-6837-lzma-2-20100806.patch, HADOOP-6837-lzma-3-20100809.patch, HADOOP-6837-lzma-4-20100811.patch, HADOOP-6837-lzma-c-20100719.patch, HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-6837) Support for LZMA compression

Posted by "Nicholas Carlini (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nicholas Carlini updated HADOOP-6837:
-------------------------------------

    Attachment: HADOOP-6837-lzma-1-20100722.patch

Attached a patch merging the java and c code together. Fixed two bugs in the C code, and removed some unneeded methods/arguments. Ran TestCodec on both of them for several hours each, and didn't find any bugs. Tested (again with TestCodec) all possible dictionary sizes from 1<<12 to 1<<23. Also added more (much needed) documentation.

> Support for LZMA compression
> ----------------------------
>
>                 Key: HADOOP-6837
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6837
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Nicholas Carlini
>            Assignee: Nicholas Carlini
>         Attachments: HADOOP-6837-lzma-1-20100722.patch, HADOOP-6837-lzma-c-20100719.patch, HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-6837) Support for LZMA compression

Posted by "Greg Roelofs (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Greg Roelofs updated HADOOP-6837:
---------------------------------

    Attachment: HADOOP-6837-lzma-1-20100722.non-trivial.pseudo-patch

Hand-edited pseudo-patch for easier reviewing.

"New" files from the LZMA SDK were diffed against that (version 9.12), and patch-hunks containing only trivial changes (Java package, formatting, Apache boilerplate, white space, etc.) were replaced with a corresponding meta-comment.  Note that the 9.12 diffs were "diff -ruw" (i.e., ignore whitespace-only changes on single lines) and only one directory level above the relevant code ("LZMA-SDK" vs. "LZMA").  The Java and C diffs were then combined, and the truly new files (e.g., LzmaCodec.java) were then added back from the original patch.

3700 lines vs. original 10400+

> Support for LZMA compression
> ----------------------------
>
>                 Key: HADOOP-6837
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6837
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Nicholas Carlini
>            Assignee: Nicholas Carlini
>         Attachments: HADOOP-6837-lzma-1-20100722.non-trivial.pseudo-patch, HADOOP-6837-lzma-1-20100722.patch, HADOOP-6837-lzma-c-20100719.patch, HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6837) Support for LZMA compression

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915938#action_12915938 ] 

Joydeep Sen Sarma commented on HADOOP-6837:
-------------------------------------------

thanks to everyone on getting lzma into hadoop. it seems to be very promising.

i have tried applying the latest patch to both hadoop-0.20 (yahoo/facebook branch) and common- trunk. in both cases - when i try running TestCodec after compiling the native codec - i get a sigsegv:

    [junit] Running org.apache.hadoop.io.compress.TestCodec
    [junit] #
    [junit] # An unexpected error has been detected by Java Runtime Environment:
    [junit] #
    [junit] #  SIGSEGV (0xb) at pc=0x00002aaad5215659, pid=16028, tid=1076017472
    [junit] #
    [junit] # Java VM: Java HotSpot(TM) 64-Bit Server VM (10.0-b23 mixed mode linux-amd64)
    [junit] # Problematic frame:
    [junit] # C  [libhadoop.so.1.0.0+0x5659]  thisRead+0x49
    [junit] #

separate from this - i had a question about tuning the compression level. in my testing on internal data using the lzma utility built from the SDK - i found a bunch of interesting option that provided a more suitable compromise between compression ratio/cpu (-a0 -mfhc4 -d24 -fbxxx) than the default. eyeing the 'level' based normalization - it seems i won't be able to quite achieve the settings i want by specifying a level. so it seems that being able to configure these options separately would be very useful.

> Support for LZMA compression
> ----------------------------
>
>                 Key: HADOOP-6837
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6837
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Nicholas Carlini
>            Assignee: Nicholas Carlini
>         Attachments: HADOOP-6837-lzma-1-20100722.non-trivial.pseudo-patch, HADOOP-6837-lzma-1-20100722.patch, HADOOP-6837-lzma-2-20100806.patch, HADOOP-6837-lzma-3-20100809.patch, HADOOP-6837-lzma-4-20100811.patch, HADOOP-6837-lzma-c-20100719.patch, HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6837) Support for LZMA compression

Posted by "Greg Roelofs (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896214#action_12896214 ] 

Greg Roelofs commented on HADOOP-6837:
--------------------------------------

On second thought, _if_ you put the version into the src/contrib path as I suggested, there's no need to add it to the boilerplate text, too.  That will make future forward-ports simpler (i.e., they can use the same boilerplate text).

> Support for LZMA compression
> ----------------------------
>
>                 Key: HADOOP-6837
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6837
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Nicholas Carlini
>            Assignee: Nicholas Carlini
>         Attachments: HADOOP-6837-lzma-1-20100722.non-trivial.pseudo-patch, HADOOP-6837-lzma-1-20100722.patch, HADOOP-6837-lzma-2-20100806.patch, HADOOP-6837-lzma-c-20100719.patch, HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-6837) Support for LZMA compression

Posted by "Nicholas Carlini (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nicholas Carlini updated HADOOP-6837:
-------------------------------------

          Status: Patch Available  (was: Open)
    Hadoop Flags: [Reviewed]

> Support for LZMA compression
> ----------------------------
>
>                 Key: HADOOP-6837
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6837
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Nicholas Carlini
>            Assignee: Nicholas Carlini
>         Attachments: HADOOP-6837-lzma-1-20100722.non-trivial.pseudo-patch, HADOOP-6837-lzma-1-20100722.patch, HADOOP-6837-lzma-2-20100806.patch, HADOOP-6837-lzma-3-20100809.patch, HADOOP-6837-lzma-4-20100811.patch, HADOOP-6837-lzma-c-20100719.patch, HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6837) Support for LZMA compression

Posted by "Eli Collins (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881929#action_12881929 ] 

Eli Collins commented on HADOOP-6837:
-------------------------------------

Hey Nicholas,

Cool stuff.  Unfortunately LGPL in incompatible with APL so we couldn't check this in. See more at http://www.apache.org/legal/resolved.html

bq. The LGPL is ineligible primarily due to the restrictions it places on larger works, violating the third license criterion. Therefore, LGPL-licensed works must not be included in Apache products

Do you need to use this particular codec or are you just looking for something better than gzip/bzip2? If the latter HADOOP-6349 (support for FastLZ) would be a great place to direct your efforts, it's got a compatible license and like LZMA is significantly faster than gzip/bzip (and faster than the open source version of lzo). 

Thanks,
Eli


> Support for LZMA compression
> ----------------------------
>
>                 Key: HADOOP-6837
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6837
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Nicholas Carlini
>            Assignee: Nicholas Carlini
>         Attachments: HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6837) Support for LZMA compression

Posted by "Greg Roelofs (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929740#action_12929740 ] 

Greg Roelofs commented on HADOOP-6837:
--------------------------------------

bq. Any progress on getting the new patch based on liblzma ready?

Not that I'm aware of.  Nicholas has been concentrating entirely on his
studies, AFAIK, and I've been working on other MR issues.


> Support for LZMA compression
> ----------------------------
>
>                 Key: HADOOP-6837
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6837
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Nicholas Carlini
>            Assignee: Nicholas Carlini
>         Attachments: HADOOP-6837-lzma-1-20100722.non-trivial.pseudo-patch, HADOOP-6837-lzma-1-20100722.patch, HADOOP-6837-lzma-2-20100806.patch, HADOOP-6837-lzma-3-20100809.patch, HADOOP-6837-lzma-4-20100811.patch, HADOOP-6837-lzma-c-20100719.patch, HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6837) Support for LZMA compression

Posted by "Allen Wittenauer (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881915#action_12881915 ] 

Allen Wittenauer commented on HADOOP-6837:
------------------------------------------

The 7z SDK license is "Public Domain" and 7z LZMA is LGPL.  Is that compatible with the APL?  

> Support for LZMA compression
> ----------------------------
>
>                 Key: HADOOP-6837
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6837
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Nicholas Carlini
>         Attachments: HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6837) Support for LZMA compression

Posted by "Greg Roelofs (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881932#action_12881932 ] 

Greg Roelofs commented on HADOOP-6837:
--------------------------------------

See the last line of their FAQ. ;-)

> Support for LZMA compression
> ----------------------------
>
>                 Key: HADOOP-6837
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6837
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Nicholas Carlini
>            Assignee: Nicholas Carlini
>         Attachments: HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6837) Support for LZMA compression

Posted by "Nicholas Carlini (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882250#action_12882250 ] 

Nicholas Carlini commented on HADOOP-6837:
------------------------------------------

The Java code from the SDK hasn't been updated since version 4.61 (which is as of 23 November, 2008), so support for LZMA2 would need to rely on C code, or be ported to Java. 

The compression ratios of LZMA and LZMA2 are nearly identical (+/- .01% from the tests I did). It does look like LZMA2 is block based and is splittable, so that would be a major plus for it.

On the differences between LZMA and LZMA2:

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; LZMA2 is an extension on top of the original LZMA. LZMA2 uses
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; LZMA internally, but adds support for flushing the encoder,
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; uncompressed chunks, eases stateful decoder implementations,
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; and improves support for multithreading.

http://tukaani.org/xz/xz-file-format.txt

I did have to add support for flushing the encoder to the Java code (flushing the encoder still produces valid lzma-compressed output).

> Support for LZMA compression
> ----------------------------
>
>                 Key: HADOOP-6837
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6837
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Nicholas Carlini
>            Assignee: Nicholas Carlini
>         Attachments: HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6837) Support for LZMA compression

Posted by "Greg Roelofs (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881928#action_12881928 ] 

Greg Roelofs commented on HADOOP-6837:
--------------------------------------

7-Zip is LGPL; the LZMA SDK is not:

"License

LZMA SDK is placed in the public domain."

Given that both packages are hosted at the same site, with links to each other on the left bar, I think we can safely assume they know the difference between the two and have made a conscious decision to release them accordingly.

> Support for LZMA compression
> ----------------------------
>
>                 Key: HADOOP-6837
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6837
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Nicholas Carlini
>            Assignee: Nicholas Carlini
>         Attachments: HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-6837) Support for LZMA compression

Posted by "Hong Tang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hong Tang updated HADOOP-6837:
------------------------------

    Status: Open  (was: Patch Available)

Canceling the patch.

> Support for LZMA compression
> ----------------------------
>
>                 Key: HADOOP-6837
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6837
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Nicholas Carlini
>            Assignee: Nicholas Carlini
>         Attachments: HADOOP-6837-lzma-1-20100722.non-trivial.pseudo-patch, HADOOP-6837-lzma-1-20100722.patch, HADOOP-6837-lzma-2-20100806.patch, HADOOP-6837-lzma-3-20100809.patch, HADOOP-6837-lzma-4-20100811.patch, HADOOP-6837-lzma-c-20100719.patch, HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-6837) Support for LZMA compression

Posted by "Nicholas Carlini (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nicholas Carlini updated HADOOP-6837:
-------------------------------------

    Attachment:     (was: hadoop-6349-2.patch)

> Support for LZMA compression
> ----------------------------
>
>                 Key: HADOOP-6837
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6837
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Nicholas Carlini
>            Assignee: Nicholas Carlini
>         Attachments: HADOOP-6837-lzma-1-20100722.patch, HADOOP-6837-lzma-c-20100719.patch, HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6837) Support for LZMA compression

Posted by "Nicholas Carlini (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896751#action_12896751 ] 

Nicholas Carlini commented on HADOOP-6837:
------------------------------------------

I did. The average number of overflow bytes is 24. I never saw it go above 120. A quick sed/dc script tells me the standard deviation is 18. So I'm fairly sure that I am correct in that it will never go above 273. Trying with setting the number of fast bytes to 273 gives average of 37 and standard deviation of 26.

> Support for LZMA compression
> ----------------------------
>
>                 Key: HADOOP-6837
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6837
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Nicholas Carlini
>            Assignee: Nicholas Carlini
>         Attachments: HADOOP-6837-lzma-1-20100722.non-trivial.pseudo-patch, HADOOP-6837-lzma-1-20100722.patch, HADOOP-6837-lzma-2-20100806.patch, HADOOP-6837-lzma-3-20100809.patch, HADOOP-6837-lzma-c-20100719.patch, HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6837) Support for LZMA compression

Posted by "Greg Roelofs (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896213#action_12896213 ] 

Greg Roelofs commented on HADOOP-6837:
--------------------------------------

20100806 version looks pretty good.  The first couple of issues are the main ones:

 * FakeOutputStream LinkedList of 1 KB buffers (BLOCKSIZE = 1024):  supposed to be 128 KB buffers (package.html)
 * directory structure = unholy mix of 7zip, Hadoop  (expected directory structure similar to LZMA SDK tarball contents, e.g.,  lzma912/Java/SevenZip/Compression/LZMA/Decoder.java and lzma912/C/LzFind.c)
   ** OK to trim some of tarball's levels out (and to be different for Java, C), and top level need not be "lzma912" (though it makes connection to download more obvious)--I know I suggested "SevenZip", but I thought that appeared in both the Java and the C paths:  suggest lzma912 or lzma-9.12 instead
 * kDefaultDictionaryLogSize no longer changed to 16:  OK?
 * apparently bogus files:
   ** src/contrib/SevenZip/ivy/libraries.properties
   ** src/contrib/ec2/bin/hadoop-ec2-env.sh
 * "LZMA SDK" -> "LZMA SDK 9.12" in all boilerplate
 * CRC:  prefer to reuse existing (e.g., PureJavaCrc32); should be compatible
 * LzmaNativeInputStream.java:
   ** "circularwould"
   ** read():  fast busy-loop not thread-friendly...and not necessary:  read(b[]) (InputStream) blocks until at least 1 byte available--zero returned only if b[] has length zero, which is not true of oneByte[]
   ** read():  t unnecessary; just do:  return (int)oneByte[0] & 0xFF;
     or even:  return (ret == -1)? -1 : (int)oneByte[0] & 0xFF;
   ** 1<<13
 * LzmaNativeOutputStream.java:
   ** buffered, sendData -> compressedDirectBuf, uncompressedDirectBuf as above
   ** 1<<16
 * LzmaOutputStream.java:
   ** 1<<17
 * Makefile.am
   ** "All these are setup" -> "All these are set up"
   ** "are also done" -> "is also done"

(Per previous feedback, please change all "1<<N" to the Java equivalent of static const int kSomeBufferSize = M * 1024:  easier to read, easier to change later.)

Btw, it's always best to follow the existing style consistently than to use your own for your changes (with the possible exception of the boilerplate comment).  Perhaps Emacs hides the issue, but with 8-space tabs, your changes to the contrib LZMA files are a complete mismatch to their style.

Be sure to run "ant javadoc" (and fix any new issues) before the next patch, and give "ant test" a shot, too (over the weekend if you happen to see this--it takes several hours to run).  I'll work with you to get "ant test-patch" going.


> Support for LZMA compression
> ----------------------------
>
>                 Key: HADOOP-6837
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6837
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Nicholas Carlini
>            Assignee: Nicholas Carlini
>         Attachments: HADOOP-6837-lzma-1-20100722.non-trivial.pseudo-patch, HADOOP-6837-lzma-1-20100722.patch, HADOOP-6837-lzma-2-20100806.patch, HADOOP-6837-lzma-c-20100719.patch, HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6837) Support for LZMA compression

Posted by "Eli Collins (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881931#action_12881931 ] 

Eli Collins commented on HADOOP-6837:
-------------------------------------

If LZMA is public domain then it should safe to include. Would be good to have clarification from the author.

> Support for LZMA compression
> ----------------------------
>
>                 Key: HADOOP-6837
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6837
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Nicholas Carlini
>            Assignee: Nicholas Carlini
>         Attachments: HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HADOOP-6837) Support for LZMA compression

Posted by "Nicholas Carlini (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nicholas Carlini reassigned HADOOP-6837:
----------------------------------------

    Assignee: Nicholas Carlini

> Support for LZMA compression
> ----------------------------
>
>                 Key: HADOOP-6837
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6837
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Nicholas Carlini
>            Assignee: Nicholas Carlini
>         Attachments: HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-6837) Support for LZMA compression

Posted by "Nicholas Carlini (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nicholas Carlini updated HADOOP-6837:
-------------------------------------

    Attachment: HADOOP-6837-lzma-3-20100809.patch

I'm going to respond to each one this time to make sure I don't miss anything again...

FakeOutputStream isn't the one I'm talking about in package.html. That's for the OutputStream/FakeInputStream. FakeOutputStream is just the one where I couldn't justify the maximum acting correctly (wrtiting a max of 273 bytes extra) so I added the linked list in case anything goes wrong.

Changed directory structure as suggested -- src/contrib/lzma912/src/SevenZip/* has the java files and src/contrib/lzma912/src/native has the native code.

The default dictionary size doesn't matter, because it gets set to something else on initialization. I left it there to keep the minimum set of changes.

I needed library.properties there for it to build correctly: I have no idea what it does, but if I delete it then it doesn't build. And when it's empty it doesn't complain, so that's why it's there.

Sorry about the ec2 file; I didn't realize that went into the patch! It's not supposed to be there.

Deleted CRC.java ... apparently it was never used.

Removed the read() while loop and wrote as suggested.

Fixed names (buffered/sendData) again, I had fixed them but then reverted to an older version when I introduced a bug.

Removed 1<<x's to y*1024.

Makefile.am, fixed grammar.

Yeah, sorry. I didn't realize that happened with the off-indentation. Fixed it.

I'll talk to you about ant test-patch. I'm running ant test right now and if anything goes wrong I'll push another patch. ant javadoc has no warnings.

> Support for LZMA compression
> ----------------------------
>
>                 Key: HADOOP-6837
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6837
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Nicholas Carlini
>            Assignee: Nicholas Carlini
>         Attachments: HADOOP-6837-lzma-1-20100722.non-trivial.pseudo-patch, HADOOP-6837-lzma-1-20100722.patch, HADOOP-6837-lzma-2-20100806.patch, HADOOP-6837-lzma-3-20100809.patch, HADOOP-6837-lzma-c-20100719.patch, HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-6837) Support for LZMA compression

Posted by "Nicholas Carlini (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nicholas Carlini updated HADOOP-6837:
-------------------------------------

    Attachment: HADOOP-6837-lzma-4-20100811.patch

Fixed two FindBugs warnings. Made one variable static and removed another.

Also I fixed build.xml (I think, it works correctly when doing 'ant package' on src/contrib, and it didn't work on the previous patch) to not crash when ant package is run. Had to override all of the build-contrib targets with empty ones. Otherwise it'd go looking for directories which didn't exist (because I don't need them to exist). 

I also changed build-contrib.xml (which is included by the other build.xml's) to have a compile-before target in it so there won't be any errors (even if I did set failonerror to false). Safer that way.

FindBugs fails because of four errors in the original SevenZip code. I could change those to static to make FindBugs happy but that would make more work to apply a patch to the java code, so unless there's a particularly good reason to do so I won't.
SIC Should SevenZip.Compression.LZMA.Decoder$LenDecoder be a _static_ inner class? 
SIC Should SevenZip.Compression.LZMA.Decoder$LiteralDecoder be a _static_ inner class? 
SIC Should SevenZip.Compression.LZMA.Encoder$LiteralEncoder be a _static_ inner class? 
SIC Should SevenZip.Compression.LZMA.Encoder$Optimal be a _static_ inner class?

And JavaDoc find an error:
[javadoc] javadoc: error - Illegal package name: ""
However, I can not reproduce this on my local machine. Let's see if it happens with Hudson.

> Support for LZMA compression
> ----------------------------
>
>                 Key: HADOOP-6837
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6837
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Nicholas Carlini
>            Assignee: Nicholas Carlini
>         Attachments: HADOOP-6837-lzma-1-20100722.non-trivial.pseudo-patch, HADOOP-6837-lzma-1-20100722.patch, HADOOP-6837-lzma-2-20100806.patch, HADOOP-6837-lzma-3-20100809.patch, HADOOP-6837-lzma-4-20100811.patch, HADOOP-6837-lzma-c-20100719.patch, HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6837) Support for LZMA compression

Posted by "Hong Tang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897391#action_12897391 ] 

Hong Tang commented on HADOOP-6837:
-----------------------------------

Nicholas's has done some interesting works. But unfortunately I will -1 for marking it "patch available". The currently patch carries a modified version of LZMA SDK. This is a huge maintenance overhead going forward where a much simpler solution clearly exists. We should explore the liblzma route first as I mentioned in http://bit.ly/cDz2Pk.

> Support for LZMA compression
> ----------------------------
>
>                 Key: HADOOP-6837
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6837
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Nicholas Carlini
>            Assignee: Nicholas Carlini
>         Attachments: HADOOP-6837-lzma-1-20100722.non-trivial.pseudo-patch, HADOOP-6837-lzma-1-20100722.patch, HADOOP-6837-lzma-2-20100806.patch, HADOOP-6837-lzma-3-20100809.patch, HADOOP-6837-lzma-4-20100811.patch, HADOOP-6837-lzma-c-20100719.patch, HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6837) Support for LZMA compression

Posted by "Hong Tang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890043#action_12890043 ] 

Hong Tang commented on HADOOP-6837:
-----------------------------------

@nicolas, per our offline conversation last week, have you looked into whether the licensing of liblzma is suitable for inclusion in Hadoop? Liblzma seems better in the sense that its API resembles closely the APIs of other compression libraries like bzip or zlib and should shrink the amount of coding work needed to support C (and Java over JNI).

> Support for LZMA compression
> ----------------------------
>
>                 Key: HADOOP-6837
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6837
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Nicholas Carlini
>            Assignee: Nicholas Carlini
>         Attachments: HADOOP-6837-lzma-c-20100719.patch, HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6837) Support for LZMA compression

Posted by "Nicholas Carlini (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894773#action_12894773 ] 

Nicholas Carlini commented on HADOOP-6837:
------------------------------------------

Responding to the major comments -- will upload a patch that fixes these and the smaller comments soon.

FakeInputStream LinkedList:
This LinkedList can get fairly long, depending on how write is called. Worst case it can have upwards of 12 million elements, which is far beyond acceptable. This is the case if the write(single_byte) is called over and over. Each call will add a new link. Looking back at this, linked list probably wasn't the best way to go.

There are two (obvious) ways that write() could have worked. One is using linked lists as I did. The other way would be to create a byte array that can hold forceWriteLen bytes and just copy into it; however this can be as large as 12MB. There are then two other ways to make this work. The first is just allocating the 12MB right up front. The other way is to start it with maybe just 64k, and make it grow (by powers of two) until it reaches 12MB, however this would end up arraycopying a little under 12MB in total more than the other solution. I will implement one of these for the patch.


FakeOutputStream LinkedList:
This linked list has a more reasonable use. Its purpose is to hold extra bytes just in case the input stream gives too many. I am fairly confident that at most 272 bytes (maximum number of fast bytes - 1) can be written to it. The reason I used a linked list, however, is that I couldn't formally prove this after going through code. I wanted to be safe and just in case their code doesn't behave as it should, everything will work on the OutputStream end.


Code(..., len)
I think I remember figuring out that Code(...) will return at least (but possibly more than) len bytes with the one exception that when the end of the stream is reached it will only read up to the end of the stream. I will modify the decompressor to no longer assume this and use the actual number of bytes read instead.


Fixed the inStream.read() bug (and will be in patch I upload). Added a while loop to read until EOF is reached so the assumptions are true.


Tail call recursive methods -> while loop. Java should add tail-call optimizations when methods only call themselves recursively (which would require no changes to the bytecode).


Fixed memory leaks.

> Support for LZMA compression
> ----------------------------
>
>                 Key: HADOOP-6837
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6837
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Nicholas Carlini
>            Assignee: Nicholas Carlini
>         Attachments: HADOOP-6837-lzma-1-20100722.non-trivial.pseudo-patch, HADOOP-6837-lzma-1-20100722.patch, HADOOP-6837-lzma-c-20100719.patch, HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6837) Support for LZMA compression

Posted by "Nicholas Carlini (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881935#action_12881935 ] 

Nicholas Carlini commented on HADOOP-6837:
------------------------------------------

Per the FAQ:

"You can also read about the LZMA SDK, which is available under a more liberal license."

http://www.7-zip.org/faq.html

> Support for LZMA compression
> ----------------------------
>
>                 Key: HADOOP-6837
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6837
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Nicholas Carlini
>            Assignee: Nicholas Carlini
>         Attachments: HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-6837) Support for LZMA compression

Posted by "Nicholas Carlini (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nicholas Carlini updated HADOOP-6837:
-------------------------------------

    Attachment: HADOOP-6837-lzma-c-20100719.patch

Uploaded C code with LzmaNativeInputStream and LzmaNativeOutputStream. Testing is the same as that for the Java code. The documentation is limited on the C side, and there are still (commented out) debug statements scattered all over.

> Support for LZMA compression
> ----------------------------
>
>                 Key: HADOOP-6837
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6837
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Nicholas Carlini
>            Assignee: Nicholas Carlini
>         Attachments: HADOOP-6837-lzma-c-20100719.patch, HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-6837) Support for LZMA compression

Posted by "Nicholas Carlini (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nicholas Carlini updated HADOOP-6837:
-------------------------------------

    Attachment: HADOOP-6837-lzma-java-20100623.patch

Attached a patch. It includes an LzmaCodec, LzmaInputStream, and LzmaOutputStream. The LZMA compression/decompression uses the LZMA SDK from http://www.7-zip.org/sdk.html. The code has been tested minimally -- when used in io.SequenceFile.java, it passes the TestSetFile/TestArrayFile tests. I will attach another later on when I have fully-tested code. I will also be working on a native version of the code written in C, also based off of the LZMA SDK. It is significantly faster than than the Java code.

> Support for LZMA compression
> ----------------------------
>
>                 Key: HADOOP-6837
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6837
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Nicholas Carlini
>         Attachments: HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6837) Support for LZMA compression

Posted by "Nicholas Carlini (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890051#action_12890051 ] 

Nicholas Carlini commented on HADOOP-6837:
------------------------------------------

I spoke with Greg about it just now and he said it would probably be better for me to work on FastLZ first, and come back to doing that latter.

> Support for LZMA compression
> ----------------------------
>
>                 Key: HADOOP-6837
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6837
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Nicholas Carlini
>            Assignee: Nicholas Carlini
>         Attachments: HADOOP-6837-lzma-c-20100719.patch, HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-6837) Support for LZMA compression

Posted by "Nicholas Carlini (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nicholas Carlini updated HADOOP-6837:
-------------------------------------

    Attachment: HADOOP-6837-lzma-2-20100806.patch

I found a bug where Java compression would set a very, very wrong dictionary size. Instead of setting the dictionary size to, say, 2^24, it would set it to 24 (which would then be forced up to the min size, but still, very very wrong).

I also added a fairly long package.html (~3000 words) with documentation about how what I did works, so anyone else who wants to modify it hopefully won't need to spend forever exploring the code again figuring out how it works.

And also, I was both right and wrong about giveMore(). I was right when I first wrote it, and wrong when I said I fixed it by using the return value. The return values were actually left over from when I was checking for the end of stream in the Java end, but I realized that it was possible (because of the semi-circular buffer) for Java to indicate an EOF but for it not to be really true. So I had moved that check to the C code and just never removed that code from that Java end.

Fixed the linked list stuff.

Also a fairly significant directory restructure. The modified SDK code now is in src/contrib/SevenZip. Java code under src/java and C code is under src/native. I removed all of my re-formatting of their code so should a future version of the SDK be released, it shouldn't be as hard to do a diff and apply the patch to make this code better. The makefile from there builds to the same build tree as otherwise for java.

In order to get it building correctly, I had to modify the base build.xml and the contrib/build.xml. compile-core-classes now also depends on compile-contrib-before. compile-contrib-before now calls compile-before on contrib/build.xml. From there contrib/build.xml calls compile-before on contrib/*/build.xml, with failonerror set to false so this change will not break any other build scripts. (This change is required because Lzma{Input,Output}Stream stream requires the classes to be built first.)

I also cleaned up the code and fixed all the review comments.

There will be at least one more version of this patch for things I didn't catch.

> Support for LZMA compression
> ----------------------------
>
>                 Key: HADOOP-6837
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6837
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Nicholas Carlini
>            Assignee: Nicholas Carlini
>         Attachments: HADOOP-6837-lzma-1-20100722.non-trivial.pseudo-patch, HADOOP-6837-lzma-1-20100722.patch, HADOOP-6837-lzma-2-20100806.patch, HADOOP-6837-lzma-c-20100719.patch, HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6837) Support for LZMA compression

Posted by "Nicholas Carlini (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892425#action_12892425 ] 

Nicholas Carlini commented on HADOOP-6837:
------------------------------------------

... that was supposed to go on HADOOP-6349, not here. Ignore that.

> Support for LZMA compression
> ----------------------------
>
>                 Key: HADOOP-6837
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6837
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Nicholas Carlini
>            Assignee: Nicholas Carlini
>         Attachments: hadoop-6349-2.patch, HADOOP-6837-lzma-1-20100722.patch, HADOOP-6837-lzma-c-20100719.patch, HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6837) Support for LZMA compression

Posted by "Erik Forsberg (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929540#action_12929540 ] 

Erik Forsberg commented on HADOOP-6837:
---------------------------------------

Any progress on getting the new patch based on liblzma ready?

> Support for LZMA compression
> ----------------------------
>
>                 Key: HADOOP-6837
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6837
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Nicholas Carlini
>            Assignee: Nicholas Carlini
>         Attachments: HADOOP-6837-lzma-1-20100722.non-trivial.pseudo-patch, HADOOP-6837-lzma-1-20100722.patch, HADOOP-6837-lzma-2-20100806.patch, HADOOP-6837-lzma-3-20100809.patch, HADOOP-6837-lzma-4-20100811.patch, HADOOP-6837-lzma-c-20100719.patch, HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6837) Support for LZMA compression

Posted by "Nicholas Carlini (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897337#action_12897337 ] 

Nicholas Carlini commented on HADOOP-6837:
------------------------------------------

The current patch does not support LZMA2, and correct, it probably won't be included in this patch. There will probably be another JIRA filed to swap out the current C code with the LibLzma library to make the C side of things cleaner. LibLzma also has LZMA2 support, so that would be free. It should be trivial to add LibLzma which has a very Zlib-like interface. And then there will probably be another JIRA to port LibLzma to Java, and make everything nice and clean.

> Support for LZMA compression
> ----------------------------
>
>                 Key: HADOOP-6837
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6837
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Nicholas Carlini
>            Assignee: Nicholas Carlini
>         Attachments: HADOOP-6837-lzma-1-20100722.non-trivial.pseudo-patch, HADOOP-6837-lzma-1-20100722.patch, HADOOP-6837-lzma-2-20100806.patch, HADOOP-6837-lzma-3-20100809.patch, HADOOP-6837-lzma-c-20100719.patch, HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6837) Support for LZMA compression

Posted by "Greg Roelofs (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897512#action_12897512 ] 

Greg Roelofs commented on HADOOP-6837:
--------------------------------------

bq. The currently patch carries a modified version of LZMA SDK. This is a huge maintenance overhead going forward where a much simpler solution clearly exists.

If the modifications are rolled into the SDK, this issue goes away.  Nicholas, can you create a current diff of the src/contrib part relative to lzma912's original path structure (i.e., so it applies cleanly to a stock lzma912 codebase)?  Then we can send it off to the 7Zip folks and see if they're willing to incorporate it.

(And only "partly exists."  There's no Java in liblzma.  On the other hand, consensus around here seems to be that built-in Java support isn't necessary.)

bq. It may be good to address the code style issue now, since this patch diverges significantly from our standard

Only the src/contrib portion does, and that was intentional.  bzip2 is no longer actively developed, so an in-tree, heavily modified port is no big deal.  LZMA, however, is still a very active project, and if we ever wanted to upgrade to a newer release (e.g., for performance or correctness reasons), we do _not_ want a lot of whitespace noise hiding the real diffs.

But this issue also largely disappears if the substantive modifications are accepted upstream; then the formatting is fairly irrelevant, though still a pain for diffs and patches.  Either way, I don't think style rules are or should necessarily be applicable to contrib code (in the outside-the-core-codebase sense of "contrib").

> Support for LZMA compression
> ----------------------------
>
>                 Key: HADOOP-6837
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6837
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Nicholas Carlini
>            Assignee: Nicholas Carlini
>         Attachments: HADOOP-6837-lzma-1-20100722.non-trivial.pseudo-patch, HADOOP-6837-lzma-1-20100722.patch, HADOOP-6837-lzma-2-20100806.patch, HADOOP-6837-lzma-3-20100809.patch, HADOOP-6837-lzma-4-20100811.patch, HADOOP-6837-lzma-c-20100719.patch, HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-6837) Support for LZMA compression

Posted by "Nicholas Carlini (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nicholas Carlini updated HADOOP-6837:
-------------------------------------

    Attachment: hadoop-6349-2.patch

Attached an update patch.

Fixed the checksum mismatch. It was possible for the decompressor to run out of input after reading the header bytes but not notice if the block ID was 1. So if there were fewer than 26 bytes in the input (but more than 16) and the byte ID was 1 then it wouldn't notice and just use whatever happened to be in the buffer at the time.

Fixed a bug in the decompressor where it would incorrectly indicate it was finished if at the end of decompressing a block there was no more input left to decompress and decompress() was then called again (TestCodec seed 1333275328, 2011623221, -1402938700 or -1990280158; generate 50,000 records). Actually, the decompressor never returns finished now. This is because the only time the decompressor should return true is if it somehow knows the end of the stream has been reached and it doesn't, it just guesses that if it has read all the bytes it currently has then it's done, which is not the case.

Implemented getRemaining().

Removed iOff from both the compressor and decompressor. It was initialized to zero from the start and was only ever modified after that by setting it to 0.

Modified TestCodec to accept a seed as an argument.

Removed the rest of the carriage returns.


I will be adding a native version over the next few days and will upload that patch when it's done.

> Support for LZMA compression
> ----------------------------
>
>                 Key: HADOOP-6837
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6837
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Nicholas Carlini
>            Assignee: Nicholas Carlini
>         Attachments: hadoop-6349-2.patch, HADOOP-6837-lzma-1-20100722.patch, HADOOP-6837-lzma-c-20100719.patch, HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6837) Support for LZMA compression

Posted by "Scott Carey (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883630#action_12883630 ] 

Scott Carey commented on HADOOP-6837:
-------------------------------------

What happens if you compress a tarball of those files instead?

Here are my results, on a directory with 4.1GB of ~64MB files.  The content is mixed binary/text (key/value data, binary keys, mixed binary/text values).

This is on CentOS 5.5, with the 'xz' and 'bzip2' packages installed via yum.

Compression / decompression speed.  Disk is capable of 200MB/sec read/write, 16GB RAM, Nehalem based processor (Xeon E5620, 2.4Ghz).
Tests confirmed to be CPU bound with no iowait.   measurements are in MB/sec for the uncompressed data.
Source tarball, 4130 MB  (100%)  
|| type | compressed size | compressed size as percent | time to compress | compression rate | time to decompress | decompression rate | 
|gzip -1|  1430MB | (34.6%)| 105 s| (39.3 MB/sec)| 42 s | 98.3 MB/sec  |
|gzip -6|  1240MB | (30.0%)| 251 s| (16.5 MB/sec)| 41.5s  | 99.5 MB/sec |
|bzip2 -2|  1003MB | (24.3%)| 656 s| (6.3 MB/sec)| 168 s | 24.6 MB/sec |
|bzip2 -6| 942MB | (22.8%)| 725 s| (5.7 MB/sec)| 176 s | 23.5 MB/sec |
|bzip2 -9| 926MB | (22.4%)| 763 s| (5.4 MB/sec)| 181 s | 22.8 MB/sec |
|xz -2|  993MB | (24.0%)| 429 s| (9.63 MB/sec)| 95s | 43.5 MB/sec |
|xz -6| 794MB | (19.2%)| 2861 s| (1.44 MB/sec)| 83s | 49.7 MB/sec |

Note that on today's newest processors, gzip decompresses at gigabit ethernet speeds.  xz is half that, and bzip2 about half that again.  Gzip ane zx decompress faster at higher compression ratios, bzip2 decompresses slower at higher ratios.  All compress slower the higher the ratio, but bzip2 only slows down by ~20% or so from the fast to slow settings, while gzip and xz slow down by a factor of 10+ (I did not do -9 tests here for those, they are very slow).

IMO, since xz-2 is almost 2x as fast at compression and decompression as bzip2, and similar in compression ratio, it leaves little room for bzip2's use.
At higher compression levels, xz is very slow to compress, but achieves compression ratios significantly better than anything else and still decompresses very fast, so its great for archival storage.

For faster compression, gzip -1 or lzo and other compression types without an entropy coder are the only options.

The link I provided above has several cases where xz is 3 or more times faster than bzip2 at decompression, but my data doesn't behave that way.


Raw Data:

$ time cat packed.tar | gzip -c1 > packed.gz1
real	1m44.938s
user	1m42.200s
sys	0m5.300s

$ time cat packed.tar | gzip -c6 > packed.gz6
real	4m11.051s
user	4m8.438s
sys	0m5.317s

$ time cat packed.tar | bzip2 -2 > packed.bz2-2
real	10m55.795s
user	10m52.989s
sys	0m5.030s

$ time cat packed.tar | bzip2 -6 > packed.bz2-6
real	12m4.847s
user	12m2.049s
sys	0m5.345s

$ time cat packed.tar | bzip2 -9 > packed.bz2-9
real	12m43.063s
user	12m40.353s
sys	0m4.797s


$ time cat packed.tar | xz -zv -2 - > packed.xz
  100.0 %             991.1 MiB / 4,125.0 MiB = 0.240   9.6 MiB/s         7:09
real	7m9.369s
user	7m6.985s
sys	0m7.140s

$ time cat packed.tar | xz -zv -6 - > packed.xz6
  100.0 %             792.6 MiB / 4,125.0 MiB = 0.192   1.4 MiB/s        47:41
real	47m41.033s
user	47m37.794s
sys	0m8.371s

------
Tests of decompression: 

$ time cat packed.gz1 | gunzip  > /dev/null
real	0m42.081s
user	0m41.814s
sys	0m1.361s

$ time cat packed.gz6 | gunzip  > /dev/null
real	0m41.512s
user	0m41.021s
sys	0m1.086s


$ time cat packed.bz2-2 | bunzip2  > /dev/null
real	2m48.528s
user	2m48.014s
sys	0m1.455s

$ time cat packed.bz2-6 | bunzip2  > /dev/null
real	2m56.511s
user	2m55.999s
sys	0m1.302s

$ time cat packed.bz2-9 | bunzip2  > /dev/null
real	3m1.064s
user	3m0.559s
sys	0m1.409s

$ time cat packed.xz | xz -dc  > /dev/null
real	1m35.239s
user	1m34.873s
sys	0m1.301s

$ time cat packed.xz6 | xz -dc  > /dev/null
real	1m23.219s
user	1m22.771s
sys	0m1.126s



> Support for LZMA compression
> ----------------------------
>
>                 Key: HADOOP-6837
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6837
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Nicholas Carlini
>            Assignee: Nicholas Carlini
>         Attachments: HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6837) Support for LZMA compression

Posted by "Scott Carey (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882214#action_12882214 ] 

Scott Carey commented on HADOOP-6837:
-------------------------------------

Isn't there a new variant of LZMA (file extension xz) that uses LZMA2 and is block based (and therefore splittable)?  We should definitely make sure that is the variant we want to support.

LZMA is slower than gzip, but compresses better than both bzip2 and gzip.  It is also optimized for fast decompression -- it decompresses significantly faster than bzip2 (but not as fast as gzip).

This link is useful for understanding the performance / compression ratio differences across the various compression levels provided for each:

http://tukaani.org/lzma/benchmarks.html


LZO, FastLZ, LZF, and the like are all faster than the above three but compress at a lower ratio.  With LZMA support (hopefully .xz files, not the older 7zip) there is little reason to use bzip2 anymore -- lzma level 2 compresses as fast as bzip2 level 1, but has a compression ratio as high as bzip2 level 9.  lzma always decompresses 2 to 7 times as fast as bzip2 (only ~ half the decompression speed of gzip). 

It is the ideal archival storage format.   

> Support for LZMA compression
> ----------------------------
>
>                 Key: HADOOP-6837
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6837
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Nicholas Carlini
>            Assignee: Nicholas Carlini
>         Attachments: HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6837) Support for LZMA compression

Posted by "Greg Roelofs (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883265#action_12883265 ] 

Greg Roelofs commented on HADOOP-6837:
--------------------------------------

Scott Carey wrote:

bq. lzma always decompresses 2 to 7 times as fast as bzip2 (only ~ half the decompression speed of gzip).

I didn't see that in my tests.  My measurements (last column) are in terms of compressed MB/sec, i.e., scaled by the compression ratio, but the ratios are close enough that that isn't a big factor:

{noformat}
bzip2-1: text = 78.9% (1.1),   1.464 (0.028) ucMB/sec,   1.189 (0.037) cMB/sec
         bin  = 50.1% (3.4),   1.395 (0.021) ucMB/sec,   2.170 (0.036) cMB/sec
bzip2-9: text = 80.5% (1.0),   1.415 (0.028) ucMB/sec,   1.135 (0.037) cMB/sec
         bin  = 51.6% (3.6),   1.340 (0.020) ucMB/sec,   1.878 (0.032) cMB/sec

xz-1:    text = 79.6% (1.0),   2.705 (0.097) ucMB/sec,   1.457 (0.049) cMB/sec
         bin  = 53.3% (3.5),   1.820 (0.031) ucMB/sec,   2.93  (0.20)  cMB/sec
xz-9:    text = 82.4% (0.8),   0.240 (0.011) ucMB/sec,   1.433 (0.051) cMB/sec
         bin  = 57.2% (3.6),   0.351 (0.010) ucMB/sec,   2.73  (0.17)  cMB/sec
{noformat}

So xz/LZMA is definitely faster to decompress, but not immensely so.  (This was all C code.  The "text" and "bin" measurements are averages across roughly 350 files of each type, various sizes.  Not a perfect corpus, but it should be varied enough to draw some reasonable conclusions.  On the other hand, the file sizes are definitely much smaller than is typical in Hadoop jobs.)

Btw, I didn't see Nicholas mention it, but all of the LZMA variants he tested appear to be stream-compatible--that is, any of the tools can decompress any of the others' streams, possibly modulo some header-parsing.

> Support for LZMA compression
> ----------------------------
>
>                 Key: HADOOP-6837
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6837
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Nicholas Carlini
>            Assignee: Nicholas Carlini
>         Attachments: HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6837) Support for LZMA compression

Posted by "Pelle Nilsson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897277#action_12897277 ] 

Pelle Nilsson commented on HADOOP-6837:
---------------------------------------

Do I read these comments correctly that LZMA2/xz is not included in the current patch, and might not be included as part of this issue since the LZMA Java lib does not support it?

> Support for LZMA compression
> ----------------------------
>
>                 Key: HADOOP-6837
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6837
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Nicholas Carlini
>            Assignee: Nicholas Carlini
>         Attachments: HADOOP-6837-lzma-1-20100722.non-trivial.pseudo-patch, HADOOP-6837-lzma-1-20100722.patch, HADOOP-6837-lzma-2-20100806.patch, HADOOP-6837-lzma-3-20100809.patch, HADOOP-6837-lzma-c-20100719.patch, HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6837) Support for LZMA compression

Posted by "Greg Roelofs (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894246#action_12894246 ] 

Greg Roelofs commented on HADOOP-6837:
--------------------------------------

First, apologies for having steered Nicholas wrong on the liblzma issue.  As
Hong noted, it provides a much saner (that is, zlib-like) API for this sort
of thing, but I mistakenly thought it shared the GPL license of (parts of)
xz, so we ignored it and he worked on the LZMA SDK code instead.  (The latter
did include a Java port, however; liblzma does not.)

Overall, the 20100722 patch looks pretty decent (given the starting material),
but it does include some less-than-beautiful workarounds to cope with the
impedance mismatch between push- and pull-style I/O models.  In light of the
fact that liblzma is, in fact, public-domain software (every file under
xz-4.999.9beta-143-g3e49/src/liblzma is either explicitly in the public
domain or has been automatically generated by such a file), I'm going to ask
that Nicholas redo the native-code version to use liblzma rather than the
SDK.  (Unfortunately, it looks like the transformation from C SDK to liblzma
was a significant amount of work, so it doesn't appear that a trivial "liblzma-
ification" of the Java SDK code is likely.  If Nicholas concurs with that
assessment, we can instead file a separate JIRA to port the liblzma C code
to Java.)  Note that liblzma includes an LZMA2 codec, so Scott Carey's
splittable-codec suggestion is within reach, too.

OK, enough preamble.  There were a number of relatively minor style issues,
which I'll simply list below, but the five main concerns were:

 - FakeInputStream.java, FakeOutputStream.java:  the linked lists of byte
   arrays are tough to swallow, even given the push/pull problem, even given
   our previous discussions on the matter.  It would be good to know what the
   stats are on these things in "typical" cases--how frequently does overflow
   occur in LzmaInputStream, for example, and how many buffers are used?

 - Is the Code(..., len) call in LzmaInputStream guaranteed to produce len
   bytes if it returns true?  The calling read() function assumes this, but
   it's not at all obvious to me; the parameter is outSize in Code(), and I
   don't see that it's decremented or even really used at all (other than
   being stored in oldOutSize), unless it's buried inside macros defined
   elsewhere.

The next two (or perhaps three) are no longer directly relevant, but they're
general things to watch out for:

 - The return value from inStream.read() in LzmaNativeInputStream.java is
   ignored even though there's no guarantee the call will return the requested
   number of bytes.  A comment ("never have to call ... again") reiterates
   this error.

 - There's no need for recursion in LzmaNativeOutputStream.java's write()
   method; iterative solutions tend to be far cleaner, I think.  (Probably
   ditto for LzmaNativeInputStream.java's read() method.)

 - LzmaCompressor.c has a pair of memleaks (state->outStream, state->inStream).

Here are the minor readability/maintainability/cosmetic/style issues:

 * add LZMA SDK version (apparently 9.12) and perhaps its release date to the
   boilerplate
 * tabs/formatting of LZMA SDK code (CRC.java, BinTree, etc.):  I _think_
   tabs are frowned upon in Hadoop, though I might be wrong; at any rate,
   they seem to be rarely used
   ** for easy Hadoop-style formatting, "indent -i2 -br -l80" is a start (though
     it's sometimes confused by Java/C++ constructs)
 * reuse existing CRC implementation(s)?  (JDK has one, Hadoop has another)
 * prefer lowercase "lzma" for subdirs
 * use uppercase "LZMA" when referring to codec/algorithm (e.g., comments)
 * add README mentioning anything special/weird/etc. (e.g., weird hashtable
   issue); summary of changes made for Hadoop; potential Java/C diffs; binary
   compatibility between various output formats (other than trivial headers/
   footers); LZMA2 (splittable, not yet implemented); liblzma (much cleaner,
   more zlib-like implementation, still PD); etc.
 * "ant javadoc" run yet?  (apparently not recently)
 * line lengths, particularly comments (LzmaInputStream.java, etc.):  should
   be no more than 80 columns in general (Hadoop guidelines)
 * avoid generic variable names for globals and class members; use existing
   conventions where possible (e.g., look at gzip/zlib and bzip2 code)
 * LzmaCodec.java:
   ** uppercase "LZMA" when referring to codec/algorithm in general
   ** "funcionality" x 4
   ** "throws ... {" continuation line:  don't further indent
 * LZ/InWindow.java
   ** leftover debug code at end
 * RangeCoder/Encoder.java
   ** spurious blank line (else just boilerplate)
 * FakeOutputStream.java:
   ** "stuffeed"
   ** "ammount"
   ** isOverflow() -> didOverflow()
 * LzmaInputStream.java:
   ** [uses FakeOutputStream]
   ** "bufferd"
   ** "we 've"
   ** index -> overflowIndex (or similar):  too generic
   ** hasRead -> initialized
   ** doInit():  use LzmaCodec getDefaultExtension() (or simply change text to
     "LZMA") instead of hardcoded ".lzma"
     *** if higher-level code extended to support .xz, would this code even know?
   ** read():  do doInit() after args-checking/throws
   ** 3rd read() has bogus @return comment
   ** overflow approach:  data on how frequent, how many buffers?
     *** no reuse:  creating new FakeOutputStream on every read()
 * LzmaOutputStream.java
   ** [uses FakeInputStream]
   ** hasWritten -> initialized
   ** 6 -> static final int defaultCompressionLevel
   ** ctor:  level -> this.level; l -> level (broken javadoc currently)
   ** "may cause a lower compression ratio":  measured?  by how much?
 * LzmaNativeInputStream.java
   ** line lengths
   ** hasRead -> initialized
   ** tmp -> tmpBuf
   ** max -> maxBlockSize?  ...?
   ** index -> ...?
   ** buffered, sendData -> compressedDirectBuf, uncompressedDirectBuf (~zlib)
   ** what -> something better
   ** 1<<16, 1<<12 -> static final whatever ints, and use N*1024 instead
   ** allocateDirect() should be using config'd buffer size, I think (check
     zlib/similar impls for use of io.file.buffer.size)
   ** giveMore():  return val not used, but if it were, true/false seems backward
     (expect:  if !giveMore() => nothing available)
   ** read():  point of extras just to support recursion?  (extras, what: what??)
   ** "decompresed"
 * LzmaNativeOutputStream.java
   ** buffered, sendData -> compressedDirectBuf, uncompressedDirectBuf (~zlib)
   ** 1<<16, 1<<17 -> static final whatever int (and 64*1024, etc.)
   ** hasWritten -> initialized
   ** tmp -> tmpBuf
 * LZMA/Encoder.java
   ** "their" -> "original SDK" or similar
   ** "Setup the encoder" -> "Set up the encoder"
   ** "Do one encoding of a block" -> "Encode one block" ?
 * LzmaCompressor.c
   ** buffered, sendData -> compressedDirectBuf, uncompressedDirectBuf (~zlib)
 * LzmaDecompressor.c
   ** inPos vs. posInInBuf:  ???
   ** buffered, sendData -> compressedDirectBuf, uncompressedDirectBuf (~zlib)
   ** 1<<16, 1<<12, difference -> macros or static const ints
 * Makefile.am
   ** zlib -> LZMA
   ** "All these are setup" -> "All these are set up"
   ** "are also done" -> "is also done"
 * LzFind.c
   ** remove commented-out stuff
   ** (1<<15)+(1<<14) -> 48*1024 (etc.)
 * LZMA/LzmaEnc.c
   ** tabs


> Support for LZMA compression
> ----------------------------
>
>                 Key: HADOOP-6837
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6837
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Nicholas Carlini
>            Assignee: Nicholas Carlini
>         Attachments: HADOOP-6837-lzma-1-20100722.non-trivial.pseudo-patch, HADOOP-6837-lzma-1-20100722.patch, HADOOP-6837-lzma-c-20100719.patch, HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6837) Support for LZMA compression

Posted by "Greg Roelofs (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881934#action_12881934 ] 

Greg Roelofs commented on HADOOP-6837:
--------------------------------------

LZMA is not faster than gzip/bzip2; it compresses better.  FastLZ (next item on Nicholas's plate) is faster than LZO but compresses more poorly than everything else (except maybe LZW).  They're both useful, but they address different parts of the problem domain.

> Support for LZMA compression
> ----------------------------
>
>                 Key: HADOOP-6837
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6837
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Nicholas Carlini
>            Assignee: Nicholas Carlini
>         Attachments: HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6837) Support for LZMA compression

Posted by "Greg Roelofs (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896739#action_12896739 ] 

Greg Roelofs commented on HADOOP-6837:
--------------------------------------

bq. FakeOutputStream isn't the one I'm talking about in package.html. That's for the OutputStream/ FakeInputStream. FakeOutputStream is just the one where I couldn't justify the maximum acting correctly (wrtiting a max of 273 bytes extra) so I added the linked list in case anything goes wrong.

Argh, yes, of course...we discussed that at least twice already.  Sorry for spacing.

Did you ever instrument it to emit a warning if it did go above 273 extra bytes?

> Support for LZMA compression
> ----------------------------
>
>                 Key: HADOOP-6837
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6837
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Nicholas Carlini
>            Assignee: Nicholas Carlini
>         Attachments: HADOOP-6837-lzma-1-20100722.non-trivial.pseudo-patch, HADOOP-6837-lzma-1-20100722.patch, HADOOP-6837-lzma-2-20100806.patch, HADOOP-6837-lzma-3-20100809.patch, HADOOP-6837-lzma-c-20100719.patch, HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.