You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Tatu Saloranta (JIRA)" <ji...@apache.org> on 2009/11/24 20:08:39 UTC

[jira] Created: (HADOOP-6389) Add support for LZF compression

Add support for LZF compression
-------------------------------

                 Key: HADOOP-6389
                 URL: https://issues.apache.org/jira/browse/HADOOP-6389
             Project: Hadoop Common
          Issue Type: New Feature
          Components: io
            Reporter: Tatu Saloranta


(note: related to [HADOOP-4874])

As per Doug's earlier comments, LZF does indeed look like a good compressor candidate for fast compression/decompression, good enough compression rate.
>From my testing it seems at least twice as fast at compression, and somewhat faster for decompressing than gzip.
Code from [http://h2database.googlecode.com/svn/trunk/h2/src/main/org/h2/compress/] is applicable, and I have tested it with json data.

I hope to have more to spend on this in near future, but if someone else gets to this first that'd be good too.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6389) Add support for LZF compression

Posted by "Tatu Saloranta (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786353#action_12786353 ] 

Tatu Saloranta commented on HADOOP-6389:
----------------------------------------

Hmmh. Looking at hadoop-commons compress package, I realize that hadoop compressors are rather complicated beasts... it's bit like reading blueprint of a lunar module or something. :-)
At least compared to relative simplicity of lzf codec to wrap within framework.
So I could use some help in figuring out best way to properly embed lzf in there, including ability to support splitting.


> Add support for LZF compression
> -------------------------------
>
>                 Key: HADOOP-6389
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6389
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>            Reporter: Tatu Saloranta
>
> (note: related to [HADOOP-4874])
> As per Doug's earlier comments, LZF does indeed look like a good compressor candidate for fast compression/decompression, good enough compression rate.
> From my testing it seems at least twice as fast at compression, and somewhat faster for decompressing than gzip.
> Code from [http://h2database.googlecode.com/svn/trunk/h2/src/main/org/h2/compress/] is applicable, and I have tested it with json data.
> I hope to have more to spend on this in near future, but if someone else gets to this first that'd be good too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6389) Add support for LZF compression

Posted by "Tatu Saloranta (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785784#action_12785784 ] 

Tatu Saloranta commented on HADOOP-6389:
----------------------------------------

Ok: I am now working with Voldemort team to get goot LZF codec adaptation (need byte[]->byte[], no need for streams in this case; also prefer using lzf standard framing so that c version is compatible), and code is available at [http://github.com/ijuma/h2-lzf].

I can now have a look at what interface Hadoop uses for codecs, to see what would be the best way to get same or modified code hooked up.

Also: one interesting thing about LZF is that its framing is not only very simple, but probably nice for splitting/merging larger files. There is no separate per-file header; instead, it is just a sequence of chunks with minimalistic headers. This means that you can just append chunks by concatenation; or split them in reverse direction, even shuffle if need be. And skipping through chunks can be done using headers without decompressing actual contents. Sounds quite nice for hadoop's use case in general... but I don't know how much support is needed from codec to let framework make good use of this.



> Add support for LZF compression
> -------------------------------
>
>                 Key: HADOOP-6389
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6389
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>            Reporter: Tatu Saloranta
>
> (note: related to [HADOOP-4874])
> As per Doug's earlier comments, LZF does indeed look like a good compressor candidate for fast compression/decompression, good enough compression rate.
> From my testing it seems at least twice as fast at compression, and somewhat faster for decompressing than gzip.
> Code from [http://h2database.googlecode.com/svn/trunk/h2/src/main/org/h2/compress/] is applicable, and I have tested it with json data.
> I hope to have more to spend on this in near future, but if someone else gets to this first that'd be good too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6389) Add support for LZF compression

Posted by "Tatu Saloranta (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893076#action_12893076 ] 

Tatu Saloranta commented on HADOOP-6389:
----------------------------------------

Although I have not worked on integration, I have been able to get a simple reusable LZF block codec published, available from github (http://github.com/ning/compress), and main Maven repo (group com.ning, artifact compress-lzf). So at least simple part (codec itself) is ready for anyone with enough familiarity to handle full integration, ideally supporting access at least at block level (can read starting from block boundaries; blocks are byte-aligned, contain compress and uncompressed block lengths to support somewhat efficient skipping of blocks).


> Add support for LZF compression
> -------------------------------
>
>                 Key: HADOOP-6389
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6389
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>            Reporter: Tatu Saloranta
>
> (note: related to [HADOOP-4874])
> As per Doug's earlier comments, LZF does indeed look like a good compressor candidate for fast compression/decompression, good enough compression rate.
> From my testing it seems at least twice as fast at compression, and somewhat faster for decompressing than gzip.
> Code from [http://h2database.googlecode.com/svn/trunk/h2/src/main/org/h2/compress/] is applicable, and I have tested it with json data.
> I hope to have more to spend on this in near future, but if someone else gets to this first that'd be good too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.