You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Binglin Chang (JIRA)" <ji...@apache.org> on 2011/01/16 18:04:52 UTC

[jira] Commented: (HADOOP-5793) High speed compression algorithm like BMDiff

    [ https://issues.apache.org/jira/browse/HADOOP-5793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12982338#action_12982338 ] 

Binglin Chang commented on HADOOP-5793:
---------------------------------------

Luke: I read the paper "Data compression using long common strings" which discribes BMDiff, it seems that the main advance of BMDiff is be capable of finding long common strings in the entire file(not only the sliding window in dict based algorithms) but hadoop use a streaming compression framework, which sends one block(buffer) at a time to compressor/decompressor, which prevents BMDiff from finding repeated strings in the entire file, and maybe leads to bad compression results? Is there any test results shows the relationship between pack(buffer) size, compression speed and ratio?

> High speed compression algorithm like BMDiff
> --------------------------------------------
>
>                 Key: HADOOP-5793
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5793
>             Project: Hadoop Common
>          Issue Type: New Feature
>            Reporter: elhoim gibor
>            Assignee: Michele Catasta
>            Priority: Minor
>
> Add a high speed compression algorithm like BMDiff.
> It gives speeds ~100MB/s for writes and ~1000MB/s for reads, compressing 2.1billions web pages from 45.1TB in 4.2TB
> Reference:
> http://norfolk.cs.washington.edu/htbin-post/unrestricted/colloq/details.cgi?id=437
> 2005 Jeff Dean talk about google architecture - around 46:00.
> http://feedblog.org/2008/10/12/google-bigtable-compression-zippy-and-bmdiff/
> http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=755678
> A reference implementation exists in HyperTable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.