You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by John Funke <fu...@gmail.com> on 2008/01/29 03:15:50 UTC
trying to perform an intentionally slow crawl - fetcher.server.delay ignored?
For the sake of politeness, I am trying to run an intentionally slow crawl
against one of our internal servers by setting the
fetcher.server.delayvalue to 20, but no matter what I change this
value to, it continues to
fetch at the same speed. I am running the latest stable version of 0.9. Also
set threads to 1.
Am I doing something wrong? Also, is there a way to do this on a
host-specific basis while fetching from other hosts at the default speed?
I notice in my hadoop.log (see below) it says "fetcher.server.delay = 1000"
regardless of what I set fetcher.server.delay to...
Thanks!
*** nutch-default.xml excerpt ***
<property>
<name>fetcher.server.delay</name>
<value>20.0</value>
<description>The number of seconds the fetcher will delay between
successive requests to the same server.</description>
</property>
*** crawl output ***
$bin/nutch crawl urls -dir crawl -depth 3
crawl started in: crawl
rootUrlDir = urls
threads = 1
depth = 3
Injector: starting
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20080129015808
Generator: filtering: false
Generator: topN: 2147483647
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: crawl/segments/20080129015808
Fetcher: threads: 1
fetching http://mysite.com/
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20080129015808]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20080129015818
Generator: filtering: false
Generator: topN: 2147483647
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: crawl/segments/20080129015818
Fetcher: threads: 1
fetching http://mysite.com/program/
fetching http://mysite.com/report/rep2007.html
fetching http://mysite.com/program/stage5.html
fetching http://mysite.com/contact.html
fetching http://mysite.com/report/
.
.
...and so on ...
*** hadoop.log excerpt: ***
2008-01-29 01:58:12,316 INFO plugin.PluginRepository - Plugins: looking in:
/home/nutch/server/nutch/plu
ins
2008-01-29 01:58:12,674 INFO plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2008-01-29 01:58:12,674 INFO plugin.PluginRepository - Registered Plugins:
2008-01-29 01:58:12,674 INFO plugin.PluginRepository - CyberNeko
HTML Parser (lib-nekohtml)
2008-01-29 01:58:12,674 INFO plugin.PluginRepository - Site Query
Filter (query-site)
2008-01-29 01:58:12,674 INFO plugin.PluginRepository - Basic URL
Normalizer (urlnormalizer-basic)
2008-01-29 01:58:12,674 INFO plugin.PluginRepository - Html Parse
Plug-in (parse-html)
2008-01-29 01:58:12,674 INFO plugin.PluginRepository - Pass-through
URL Normalizer (urlnormalizer-pa
s)
2008-01-29 01:58:12,674 INFO plugin.PluginRepository - Regex URL
Filter Framework (lib-regex-filter)
2008-01-29 01:58:12,674 INFO plugin.PluginRepository - Basic
Indexing Filter (index-basic)
2008-01-29 01:58:12,674 INFO plugin.PluginRepository - Basic
Summarizer Plug-in (summary-basic)
2008-01-29 01:58:12,674 INFO plugin.PluginRepository - Text Parse
Plug-in (parse-text)
2008-01-29 01:58:12,674 INFO plugin.PluginRepository - JavaScript
Parser (parse-js)
2008-01-29 01:58:12,674 INFO plugin.PluginRepository - Regex URL
Filter (urlfilter-regex)
2008-01-29 01:58:12,675 INFO plugin.PluginRepository - Basic Query
Filter (query-basic)
2008-01-29 01:58:12,675 INFO plugin.PluginRepository - HTTP
Framework (lib-http)
2008-01-29 01:58:12,675 INFO plugin.PluginRepository - URL Query
Filter (query-url)
2008-01-29 01:58:12,675 INFO plugin.PluginRepository - Regex URL
Normalizer (urlnormalizer-regex)
2008-01-29 01:58:12,675 INFO plugin.PluginRepository - Http
Protocol Plug-in (protocol-http)
2008-01-29 01:58:12,675 INFO plugin.PluginRepository - the nutch
core extension points (nutch-extens
onpoints)
2008-01-29 01:58:12,675 INFO plugin.PluginRepository - OPIC Scoring
Plug-in (scoring-opic)
2008-01-29 01:58:12,675 INFO plugin.PluginRepository - Registered
Extension-Points:
2008-01-29 01:58:12,675 INFO plugin.PluginRepository - Nutch
Summarizer (org.apache.nutch.searcher.S
mmarizer)
2008-01-29 01:58:12,675 INFO plugin.PluginRepository - Nutch
Scoring (org.apache.nutch.scoring.Scori
gFilter)
2008-01-29 01:58:12,675 INFO plugin.PluginRepository - Nutch
Protocol (org.apache.nutch.protocol.Pro
ocol)
2008-01-29 01:58:12,675 INFO plugin.PluginRepository - Nutch URL
Normalizer (org.apache.nutch.net.UR
Normalizer)
2008-01-29 01:58:12,675 INFO plugin.PluginRepository - Nutch URL
Filter (org.apache.nutch.net.URLFil
er)
2008-01-29 01:58:12,675 INFO plugin.PluginRepository - HTML Parse
Filter (org.apache.nutch.parse.Htm
ParseFilter)
2008-01-29 01:58:12,675 INFO plugin.PluginRepository - Nutch Online
Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2008-01-29 01:58:12,675 INFO plugin.PluginRepository - Nutch
Indexing Filter (org.apache.nutch.index
r.IndexingFilter)
2008-01-29 01:58:12,675 INFO plugin.PluginRepository - Nutch
Content Parser (org.apache.nutch.parse.
arser)
2008-01-29 01:58:12,675 INFO plugin.PluginRepository - Ontology
Model Loader (org.apache.nutch.ontol
gy.Ontology)
2008-01-29 01:58:12,675 INFO plugin.PluginRepository - Nutch
Analysis (org.apache.nutch.analysis.Nut
hAnalyzer)
2008-01-29 01:58:12,675 INFO plugin.PluginRepository - Nutch Query
Filter (org.apache.nutch.searcher
QueryFilter)
2008-01-29 01:58:12,739 INFO fetcher.Fetcher - fetching http://mysite.com/
2008-01-29 01:58:12,752 INFO http.Http - http.proxy.host = null
2008-01-29 01:58:12,752 INFO http.Http - http.proxy.port = 8080
2008-01-29 01:58:12,752 INFO http.Http - http.timeout = 10000
2008-01-29 01:58:12,752 INFO http.Http - http.content.limit = 65536
2008-01-29 01:58:12,752 INFO http.Http - http.agent = FLA Spider/Nutch-0.9
2008-01-29 01:58:12,752 INFO http.Http - protocol.plugin.check.blocking =
true
2008-01-29 01:58:12,756 INFO http.Http - protocol.plugin.check.robots =
true
2008-01-29 01:58:12,756 INFO http.Http - fetcher.server.delay = 1000
2008-01-29 01:58:12,756 INFO http.Http - http.max.delays = 1000
2008-01-29 01:58:13,257 WARN regex.RegexURLNormalizer - can't find rules
for scope 'outlink', using default
2008-01-29 01:58:13,563 INFO crawl.SignatureFactory - Using Signature impl:
org.apache.nutch.crawl.MD5Signat
re
.
.
...and so on...
:
Re: trying to perform an intentionally slow crawl - fetcher.server.delay
ignored?
Posted by Andrzej Bialecki <ab...@getopt.org>.
John Funke wrote:
> For the sake of politeness, I am trying to run an intentionally slow crawl
> against one of our internal servers by setting the
> fetcher.server.delayvalue to 20, but no matter what I change this
> value to, it continues to
> fetch at the same speed. I am running the latest stable version of 0.9. Also
> set threads to 1.
Please check your /robots.txt - fetcher.server.delay is just the initial
value. If your robots.txt specifies another value, that value will be
used instead (see Fetcher2:520).
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com