You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2015/04/01 19:08:18 UTC
[Nutch Wiki] Update of "CommonCrawlDataDumper" by darrencheng
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "CommonCrawlDataDumper" page has been changed by darrencheng:
https://wiki.apache.org/nutch/CommonCrawlDataDumper?action=diff&rev1=2&rev2=3
bin/nutch commoncrawldump -outputDir outCommonCrawl -segment testCrawl/segments
}}}
+ If when you start running the script later you start getting an error called {{{OutOfMemoryError}}}, try changing the JAVA_HEAP_MAX variable in line 128 of {{{bin/nutch}}} to an appropriate value.
+
The {{{bin/nutch commoncrawldump}}} program dumps out all Nutch segments included in {{{testCrawl/segments}}} to {{{outCommonCrawl}}} folder, making one CBOR-encoded file for each crawled file. The tool will show a short report as follows:
{{{