You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Kai_testing Middleton <ka...@yahoo.com> on 2007/06/22 00:24:05 UTC
fetching http://www.variety.com/
I'm new to nutch and attempting a few simple tests in preparation for some major crawling work.
My current test is to crawl www.variety.com to a depth of 2.
I have set things up as I'm supposed to but I get the following in my crawl output:
fetching http://www.variety.com/RSS.asp
fetching http://www.variety.com/boxoffice
fetching http://www.variety.com/pilotwatch2007
fetching http://www.variety.com/
fetching http://www.variety.com/review/VE1117933968
fetching http://www.variety.com/graphics/marketing/siriussurvey07_6.html
fetching http://www.variety.com/review/VE1117933972
fetching http://www.variety.com/article/VR1117967371
fetching http://www.variety.com/
Is nutch seriously broken? Why is it trying to fetch those two URLs with the embedded html?
Details:
I'm running nutch 0.9 (the stable download from April 2007) on BSD with java 1.5.
I invoke nutch as follows:
nutch crawl urls.txt -dir mydir -depth 2 2>&1 | tee crawl.log
urls.txt contains this:
http://www.variety.com
crawl-urlfilter.txt contains this:
# The url filter file used by the crawl command.
# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.
# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'. The first matching pattern in the file
# determines whether a URL is included or ignored. If no pattern
# matches, the URL is ignored.
# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/.+?)/.*?\1/.*?\1/
# accept hosts in MY.DOMAIN.NAME
+^http://www.variety.com
# skip everything else
-.
I have the following in nutch-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>http.agent.name</name>
<value>testbed-random</value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
</description>
</property>
<property>
<name>http.agent.description</name>
<value>crawler v0.9</value>
<description>Further description of our bot- this text is used in
the User-Agent header. It appears in parenthesis after the agent name.
</description>
</property>
<property>
<name>http.agent.url</name>
http://hopoo.dyndns.org
<description>A URL to advertise in the User-Agent header. This will
appear in parenthesis after the agent name. Custom dictates that this
should be a URL of a page explaining the purpose and behavior of this
crawler.
</description>
</property>
<property>
<name>http.agent.email</name>
<value>kai(underscore)testing(att)yahoo(dotcom)</value>
<description>An email address to advertise in the HTTP 'From' request
header and User-Agent header. A good practice is to mangle this
address (e.g. 'info at example dot com') to avoid spamming.
</description>
</property>
</configuration>
crawl.log follows:
crawl started in: mydir
rootUrlDir = urls.txt
threads = 10
depth = 2
Injector: starting
Injector: crawlDb: mydir/crawldb
Injector: urlDir: urls.txt
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: mydir/segments/20070621145957
Generator: filtering: false
Generator: topN: 2147483647
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: mydir/segments/20070621145957
Fetcher: threads: 10
fetching http://www.variety.com/
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: mydir/crawldb
CrawlDb update: segments: [mydir/segments/20070621145957]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: mydir/segments/20070621150010
Generator: filtering: false
Generator: topN: 2147483647
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: mydir/segments/20070621150010
Fetcher: threads: 10
fetching http://www.variety.com/RSS.asp
fetching http://www.variety.com/boxoffice
fetching http://www.variety.com/pilotwatch2007
fetching http://www.variety.com/
fetching http://www.variety.com/review/VE1117933968
fetching http://www.variety.com/graphics/marketing/siriussurvey07_6.html
fetching http://www.variety.com/review/VE1117933972
fetching http://www.variety.com/article/VR1117967371
fetching http://www.variety.com/
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: mydir/crawldb
CrawlDb update: segments: [mydir/segments/20070621150010]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
LinkDb: starting
LinkDb: linkdb: mydir/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: mydir/segments/20070621145957
LinkDb: adding segment: mydir/segments/20070621150010
LinkDb: done
Indexer: starting
Indexer: linkdb: mydir/linkdb
Indexer: adding segment: mydir/segments/20070621145957
Indexer: adding segment: mydir/segments/20070621150010
Indexing [http://www.variety.com/] with analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@1d2b9b7 (null)
Indexing [http://www.variety.com/pilotwatch2007] with analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@1d2b9b7 (null)
Optimizing index.
merging segments _ram_0 (1 docs) _ram_1 (1 docs) into _0 (2 docs)
Indexer: done
Dedup: starting
Dedup: adding indexes in: mydir/indexes
Dedup: done
merging indexes to: mydir/index
Adding mydir/indexes/part-00000
done merging
crawl finished: mydir
____________________________________________________________________________________
Looking for a deal? Find great prices on flights and hotels with Yahoo! FareChase.
http://farechase.yahoo.com/