You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by John Funke <fu...@gmail.com> on 2008/01/29 03:15:50 UTC

trying to perform an intentionally slow crawl - fetcher.server.delay ignored?

For the sake of politeness, I am trying to run an intentionally slow crawl
against one of our internal servers by setting the
fetcher.server.delayvalue to 20, but no matter what I change this
value to, it continues to
fetch at the same speed. I am running the latest stable version of 0.9. Also
set threads to 1.

Am I doing something wrong? Also, is there a way to do this on a
host-specific basis while fetching from other hosts at the default speed?

I notice in my hadoop.log (see below) it says "fetcher.server.delay = 1000"
regardless of what I set fetcher.server.delay to...

Thanks!

*** nutch-default.xml excerpt ***
<property>
  <name>fetcher.server.delay</name>
  <value>20.0</value>
  <description>The number of seconds the fetcher will delay between
   successive requests to the same server.</description>
</property>

*** crawl output ***
$bin/nutch crawl urls -dir crawl -depth 3

crawl started in: crawl
rootUrlDir = urls
threads = 1
depth = 3
Injector: starting
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20080129015808
Generator: filtering: false
Generator: topN: 2147483647
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: crawl/segments/20080129015808
Fetcher: threads: 1
fetching http://mysite.com/
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20080129015808]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20080129015818
Generator: filtering: false
Generator: topN: 2147483647
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: crawl/segments/20080129015818
Fetcher: threads: 1
fetching http://mysite.com/program/
fetching http://mysite.com/report/rep2007.html
fetching http://mysite.com/program/stage5.html
fetching http://mysite.com/contact.html
fetching http://mysite.com/report/
.
.
...and so on ...


*** hadoop.log excerpt: ***

2008-01-29 01:58:12,316 INFO  plugin.PluginRepository - Plugins: looking in:
/home/nutch/server/nutch/plu
ins
2008-01-29 01:58:12,674 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2008-01-29 01:58:12,674 INFO  plugin.PluginRepository - Registered Plugins:
2008-01-29 01:58:12,674 INFO  plugin.PluginRepository -         CyberNeko
HTML Parser (lib-nekohtml)
2008-01-29 01:58:12,674 INFO  plugin.PluginRepository -         Site Query
Filter (query-site)
2008-01-29 01:58:12,674 INFO  plugin.PluginRepository -         Basic URL
Normalizer (urlnormalizer-basic)
2008-01-29 01:58:12,674 INFO  plugin.PluginRepository -         Html Parse
Plug-in (parse-html)
2008-01-29 01:58:12,674 INFO  plugin.PluginRepository -         Pass-through
URL Normalizer (urlnormalizer-pa
s)
2008-01-29 01:58:12,674 INFO  plugin.PluginRepository -         Regex URL
Filter Framework (lib-regex-filter)
2008-01-29 01:58:12,674 INFO  plugin.PluginRepository -         Basic
Indexing Filter (index-basic)
2008-01-29 01:58:12,674 INFO  plugin.PluginRepository -         Basic
Summarizer Plug-in (summary-basic)
2008-01-29 01:58:12,674 INFO  plugin.PluginRepository -         Text Parse
Plug-in (parse-text)
2008-01-29 01:58:12,674 INFO  plugin.PluginRepository -         JavaScript
Parser (parse-js)
2008-01-29 01:58:12,674 INFO  plugin.PluginRepository -         Regex URL
Filter (urlfilter-regex)
2008-01-29 01:58:12,675 INFO  plugin.PluginRepository -         Basic Query
Filter (query-basic)
2008-01-29 01:58:12,675 INFO  plugin.PluginRepository -         HTTP
Framework (lib-http)
2008-01-29 01:58:12,675 INFO  plugin.PluginRepository -         URL Query
Filter (query-url)
2008-01-29 01:58:12,675 INFO  plugin.PluginRepository -         Regex URL
Normalizer (urlnormalizer-regex)
2008-01-29 01:58:12,675 INFO  plugin.PluginRepository -         Http
Protocol Plug-in (protocol-http)
2008-01-29 01:58:12,675 INFO  plugin.PluginRepository -         the nutch
core extension points (nutch-extens
onpoints)
2008-01-29 01:58:12,675 INFO  plugin.PluginRepository -         OPIC Scoring
Plug-in (scoring-opic)
2008-01-29 01:58:12,675 INFO  plugin.PluginRepository - Registered
Extension-Points:
2008-01-29 01:58:12,675 INFO  plugin.PluginRepository -         Nutch
Summarizer (org.apache.nutch.searcher.S
mmarizer)
2008-01-29 01:58:12,675 INFO  plugin.PluginRepository -         Nutch
Scoring (org.apache.nutch.scoring.Scori
gFilter)
2008-01-29 01:58:12,675 INFO  plugin.PluginRepository -         Nutch
Protocol (org.apache.nutch.protocol.Pro
ocol)
2008-01-29 01:58:12,675 INFO  plugin.PluginRepository -         Nutch URL
Normalizer (org.apache.nutch.net.UR
Normalizer)
2008-01-29 01:58:12,675 INFO  plugin.PluginRepository -         Nutch URL
Filter (org.apache.nutch.net.URLFil
er)
2008-01-29 01:58:12,675 INFO  plugin.PluginRepository -         HTML Parse
Filter (org.apache.nutch.parse.Htm
ParseFilter)
2008-01-29 01:58:12,675 INFO  plugin.PluginRepository -         Nutch Online
Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2008-01-29 01:58:12,675 INFO  plugin.PluginRepository -         Nutch
Indexing Filter (org.apache.nutch.index
r.IndexingFilter)
2008-01-29 01:58:12,675 INFO  plugin.PluginRepository -         Nutch
Content Parser (org.apache.nutch.parse.
arser)
2008-01-29 01:58:12,675 INFO  plugin.PluginRepository -         Ontology
Model Loader (org.apache.nutch.ontol
gy.Ontology)
2008-01-29 01:58:12,675 INFO  plugin.PluginRepository -         Nutch
Analysis (org.apache.nutch.analysis.Nut
hAnalyzer)
2008-01-29 01:58:12,675 INFO  plugin.PluginRepository -         Nutch Query
Filter (org.apache.nutch.searcher
QueryFilter)
2008-01-29 01:58:12,739 INFO  fetcher.Fetcher - fetching http://mysite.com/
2008-01-29 01:58:12,752 INFO  http.Http - http.proxy.host = null
2008-01-29 01:58:12,752 INFO  http.Http - http.proxy.port = 8080
2008-01-29 01:58:12,752 INFO  http.Http - http.timeout = 10000
2008-01-29 01:58:12,752 INFO  http.Http - http.content.limit = 65536
2008-01-29 01:58:12,752 INFO  http.Http - http.agent = FLA Spider/Nutch-0.9
2008-01-29 01:58:12,752 INFO  http.Http - protocol.plugin.check.blocking =
true
2008-01-29 01:58:12,756 INFO  http.Http - protocol.plugin.check.robots =
true
2008-01-29 01:58:12,756 INFO  http.Http - fetcher.server.delay = 1000
2008-01-29 01:58:12,756 INFO  http.Http - http.max.delays = 1000
2008-01-29 01:58:13,257 WARN  regex.RegexURLNormalizer - can't find rules
for scope 'outlink', using default
2008-01-29 01:58:13,563 INFO  crawl.SignatureFactory - Using Signature impl:
org.apache.nutch.crawl.MD5Signat
re
.
.
...and so on...

:

Re: trying to perform an intentionally slow crawl - fetcher.server.delay ignored?

Posted by Andrzej Bialecki <ab...@getopt.org>.

John Funke wrote:
> For the sake of politeness, I am trying to run an intentionally slow crawl
> against one of our internal servers by setting the
> fetcher.server.delayvalue to 20, but no matter what I change this
> value to, it continues to
> fetch at the same speed. I am running the latest stable version of 0.9. Also
> set threads to 1.

Please check your /robots.txt - fetcher.server.delay is just the initial 
value. If your robots.txt specifies another value, that value will be 
used instead (see Fetcher2:520).


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com