You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Yossi Tamari (JIRA)" <ji...@apache.org> on 2017/11/20 14:35:00 UTC
[jira] [Created] (NUTCH-2463) Enable sampling CrawlDB
Yossi Tamari created NUTCH-2463:
-----------------------------------
Summary: Enable sampling CrawlDB
Key: NUTCH-2463
URL: https://issues.apache.org/jira/browse/NUTCH-2463
Project: Nutch
Issue Type: Improvement
Components: crawldb
Reporter: Yossi Tamari
Priority: Minor
CrawlDB can grow to contain billions of records. When that happens *readdb -dump* is pretty useless, and *readdb -topN* can run for ages (and does not provide a statistically correct sample).
We should add a parameter *-sample* to *readdb -dump* which is followed by a number between 0 and 1, and only that fraction of records from the CrawlDB will be processed.
The sample should be statistically random, and all the other filters should be applied on the sampled records.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)