You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Nguyen Manh Tien (JIRA)" <ji...@apache.org> on 2013/11/25 03:44:35 UTC

[jira] [Created] (NUTCH-1674) Use batchId filter enable scan (GORA-119) for Fetch,Parse,Update,Index

Nguyen Manh Tien created NUTCH-1674:
---------------------------------------

             Summary: Use batchId filter enable scan (GORA-119) for Fetch,Parse,Update,Index
                 Key: NUTCH-1674
                 URL: https://issues.apache.org/jira/browse/NUTCH-1674
             Project: Nutch
          Issue Type: Improvement
    Affects Versions: 2.3
            Reporter: Nguyen Manh Tien
         Attachments: NUTCH-1674.patch

Nutch always scan the whole crawldb in each phrase (generate, fetch, parse, update, index). When crawldb is big, the time to scan is bigger than the actual processing time.
We really need to skip records while scanning using GORA-119 for example we can only get records belong to a specified batchId.
In my crawl the filter reduce the time to scan from 90 min to 30 min.



--
This message was sent by Atlassian JIRA
(v6.1#6144)