You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Lewis John McGibbney (JIRA)" <ji...@apache.org> on 2015/03/12 19:14:38 UTC

[jira] [Created] (NUTCH-1963) CommonsCrawlDataDumper is too long ( > 100 bytes) when -gzip option invoked

Lewis John McGibbney created NUTCH-1963:
-------------------------------------------

             Summary: CommonsCrawlDataDumper is too long ( > 100 bytes) when -gzip option invoked
                 Key: NUTCH-1963
                 URL: https://issues.apache.org/jira/browse/NUTCH-1963
             Project: Nutch
          Issue Type: Bug
          Components: commoncrawl
    Affects Versions: 1.10
            Reporter: Lewis John McGibbney
             Fix For: 1.10


When invoking the commoncrawldump tool with the *-gzip* option and *-mimtype application/pdf* I get the following stack trace which results in a failure of the task

{code}
java.lang.RuntimeException: file name 'Socio-Economic%20Impact%20of%20Ebola%20on%20Households%20in%20Liberia%20Nov%2019%20(final,%20revised).pdf' is too long ( > 100 bytes)
	at org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.handleLongName(TarArchiveOutputStream.java:674)
	at org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.putArchiveEntry(TarArchiveOutputStream.java:275)
	at org.apache.nutch.tools.CommonCrawlDataDumper.dump(CommonCrawlDataDumper.java:400)
	at org.apache.nutch.tools.CommonCrawlDataDumper.main(CommonCrawlDataDumper.java:236)
{code}

The workaround consists of not using the *-gzip* option, instead delaying this until a later task, however this is a workaround and not a solution.
We need to fix this in order for the tool to work as designed and required.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)