You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Giuseppe Totaro (JIRA)" <ji...@apache.org> on 2015/03/11 18:20:38 UTC
[jira] [Created] (NUTCH-1961) Provide multipart compression of
Common Crawl data
Giuseppe Totaro created NUTCH-1961:
--------------------------------------
Summary: Provide multipart compression of Common Crawl data
Key: NUTCH-1961
URL: https://issues.apache.org/jira/browse/NUTCH-1961
Project: Nutch
Issue Type: Wish
Affects Versions: 1.9
Reporter: Giuseppe Totaro
Priority: Minor
Using {{-gzip}} option in {{CommonCrawlDataDumper}}, users are able to compress data and create a TAR archive (using the [Apache Commons Compress|http://commons.apache.org/proper/commons-compress].
We could provide also the opportunity to make multipart compressed archive using a threshold. I did some tests using a {{CountingOutputStream}} "in the middle" in order to count bytes written, but it requires to flush the output streams at each iteration.
Furthermore, _gzip_ does not support multipart compression (we can split the archive in multiple {{.tar.gz}} files but they have to be unzipped individually), whereas _zip_ does (even though this feature is not supported yet in Apache Commons Compress).
I would really appreciate your feedback/ideas about this.
Thanks a lot,
Giuseppe
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)