You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2011/04/01 17:05:06 UTC

[jira] [Closed] (NUTCH-87) Efficient site-specific crawling for a large number of sites

     [ https://issues.apache.org/jira/browse/NUTCH-87?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma closed NUTCH-87.
------------------------------

    Resolution: Won't Fix

Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

> Efficient site-specific crawling for a large number of sites
> ------------------------------------------------------------
>
>                 Key: NUTCH-87
>                 URL: https://issues.apache.org/jira/browse/NUTCH-87
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.7.2, 0.8
>         Environment: cross-platform
>            Reporter: AJ Chen
>         Attachments: JIRA-87-whitelistfilter.tar.gz, build.xml.patch, build.xml.patch-0.8, urlfilter-whitelist.tar.gz
>
>
> There is a gap between whole-web crawling and single (or handful) site crawling. Many applications actually fall in this gap, which usually require to crawl a large number of selected sites, say 100000 domains. Current CrawlTool is designed for a handful of sites. So, this request calls for a new feature or improvement on CrawTool so that "nutch crawl" command can efficiently deal with large number of sites. One requirement is to add or change smallest amount of code so that this feature can be implemented sooner rather than later. 
> There is a discussion about adding a URLFilter to implement this requested feature, see the following thread - 
> http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00726.html
> The idea is to use a hashtable in URLFilter for looking up regex for any given domain. Hashtable will be much faster than list implementation currently used in RegexURLFilter.  Fortunately, Matt Kangas has implemented such idea before for his own application and is willing to make it available for adaptation to Nutch. I'll be happy to help him in this regard.  
> But, before we do it, we would like to hear more discussions or comments about this approach or other approaches. Particularly, let us know what potential downside will be for hashtable lookup in a new URLFilter plugin.
> AJ Chen

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira