You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Giuseppe Totaro (JIRA)" <ji...@apache.org> on 2015/03/11 18:20:38 UTC

[jira] [Created] (NUTCH-1961) Provide multipart compression of Common Crawl data

Giuseppe Totaro created NUTCH-1961:
--------------------------------------

             Summary: Provide multipart compression of Common Crawl data
                 Key: NUTCH-1961
                 URL: https://issues.apache.org/jira/browse/NUTCH-1961
             Project: Nutch
          Issue Type: Wish
    Affects Versions: 1.9
            Reporter: Giuseppe Totaro
            Priority: Minor


Using {{-gzip}} option in {{CommonCrawlDataDumper}}, users are able to compress data and create a TAR archive (using the [Apache Commons Compress|http://commons.apache.org/proper/commons-compress]. 
We could provide also the opportunity to make multipart compressed archive using a threshold. I did some tests using a {{CountingOutputStream}} "in the middle" in order to count bytes written, but it requires to flush the output streams at each iteration.
Furthermore, _gzip_ does not support multipart compression (we can split the archive in multiple {{.tar.gz}} files but they have to be unzipped individually), whereas _zip_ does (even though this feature is not supported yet in Apache Commons Compress).
I would really appreciate your feedback/ideas about this.
Thanks a lot,
Giuseppe



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)