You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Matt Kangas (JIRA)" <ji...@apache.org> on 2005/09/11 07:28:32 UTC

[jira] Commented: (NUTCH-87) Efficient site-specific crawling for a large number of sites

    [ http://issues.apache.org/jira/browse/NUTCH-87?page=comments#action_12323157 ] 

Matt Kangas commented on NUTCH-87:
----------------------------------

Sample edits to nutch-site.xml for use with this plugin:


<property>
  <name>epile.crawl.whitelist.enableUndirectedCrawl</name>
  <value>false</value>
</property>

<property>
  <name>urlfilter.whitelist.file</name>
  <value>/var/epile/crawl/whitelist_map</value>
  <description>Name of file containing the location of the on-disk whitelist map directory.</description>
</property>

<property>
  <name>plugin.includes</name>
  <value>epile-whitelisturlfilter|urlfilter-(prefix|regex)|parse-(text|html)|index-basic|query-(basic|site|url)</value>
</property>

<property>
  <name>urlfilter.order</name>
  <value>org.apache.nutch.net.RegexURLFilter epile.crawl.plugin.WhitelistURLFilter</value> 
</property>


> Efficient site-specific crawling for a large number of sites
> ------------------------------------------------------------
>
>          Key: NUTCH-87
>          URL: http://issues.apache.org/jira/browse/NUTCH-87
>      Project: Nutch
>         Type: New Feature
>   Components: fetcher
>  Environment: cross-platform
>     Reporter: AJ Chen
>  Attachments: JIRA-87-whitelistfilter.tar.gz
>
> There is a gap between whole-web crawling and single (or handful) site crawling. Many applications actually fall in this gap, which usually require to crawl a large number of selected sites, say 100000 domains. Current CrawlTool is designed for a handful of sites. So, this request calls for a new feature or improvement on CrawTool so that "nutch crawl" command can efficiently deal with large number of sites. One requirement is to add or change smallest amount of code so that this feature can be implemented sooner rather than later. 
> There is a discussion about adding a URLFilter to implement this requested feature, see the following thread - 
> http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00726.html
> The idea is to use a hashtable in URLFilter for looking up regex for any given domain. Hashtable will be much faster than list implementation currently used in RegexURLFilter.  Fortunately, Matt Kangas has implemented such idea before for his own application and is willing to make it available for adaptation to Nutch. I'll be happy to help him in this regard.  
> But, before we do it, we would like to hear more discussions or comments about this approach or other approaches. Particularly, let us know what potential downside will be for hashtable lookup in a new URLFilter plugin.
> AJ Chen

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira