You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Kai_testing Middleton <ka...@yahoo.com> on 2007/06/22 00:24:05 UTC
fetching http://www.variety.com/

I'm new to nutch and attempting a few simple tests in preparation for some major crawling work.

My current test is to crawl www.variety.com to a depth of 2.

I have set things up as I'm supposed to but I get the following in my crawl output:

fetching http://www.variety.com/RSS.asp
fetching http://www.variety.com/boxoffice
fetching http://www.variety.com/pilotwatch2007
fetching http://www.variety.com/
fetching http://www.variety.com/review/VE1117933968
fetching http://www.variety.com/graphics/marketing/siriussurvey07_6.html
fetching http://www.variety.com/review/VE1117933972
fetching http://www.variety.com/article/VR1117967371
fetching http://www.variety.com/

Is nutch seriously broken?  Why is it trying to fetch those two URLs with the embedded html?


Details:

I'm running nutch 0.9 (the stable download from April 2007) on BSD with java 1.5.

I invoke nutch as follows:
nutch crawl urls.txt -dir mydir -depth 2 2>&1 | tee crawl.log

urls.txt contains this:
http://www.variety.com

crawl-urlfilter.txt contains this:
# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/.+?)/.*?\1/.*?\1/

# accept hosts in MY.DOMAIN.NAME
+^http://www.variety.com

# skip everything else
-.




I have the following in nutch-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
  <name>http.agent.name</name>
  <value>testbed-random</value>  
  <description>HTTP 'User-Agent' request header. MUST NOT be empty - 
  please set this to a single word uniquely related to your organization.
  NOTE: You should also check other related properties:

    http.robots.agents
    http.agent.description
    http.agent.url
    http.agent.email
    http.agent.version
  and set their values appropriately.
  </description>
</property>
<property>
  <name>http.agent.description</name>
  <value>crawler v0.9</value>
  <description>Further description of our bot- this text is used in
  the User-Agent header.  It appears in parenthesis after the agent name.
  </description>
</property>
<property>
  <name>http.agent.url</name>
  http://hopoo.dyndns.org
  <description>A URL to advertise in the User-Agent header.  This will 
   appear in parenthesis after the agent name. Custom dictates that this
   should be a URL of a page explaining the purpose and behavior of this
   crawler.
  </description>
</property>
<property>
  <name>http.agent.email</name>
  <value>kai(underscore)testing(att)yahoo(dotcom)</value>
  <description>An email address to advertise in the HTTP 'From' request
   header and User-Agent header. A good practice is to mangle this
   address (e.g. 'info at example dot com') to avoid spamming.
  </description>
</property>
</configuration>


crawl.log follows:
crawl started in: mydir
rootUrlDir = urls.txt
threads = 10
depth = 2
Injector: starting
Injector: crawlDb: mydir/crawldb
Injector: urlDir: urls.txt
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: mydir/segments/20070621145957
Generator: filtering: false
Generator: topN: 2147483647
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: mydir/segments/20070621145957
Fetcher: threads: 10
fetching http://www.variety.com/
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: mydir/crawldb
CrawlDb update: segments: [mydir/segments/20070621145957]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: mydir/segments/20070621150010
Generator: filtering: false
Generator: topN: 2147483647
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: mydir/segments/20070621150010
Fetcher: threads: 10
fetching http://www.variety.com/RSS.asp
fetching http://www.variety.com/boxoffice
fetching http://www.variety.com/pilotwatch2007
fetching http://www.variety.com/
fetching http://www.variety.com/review/VE1117933968
fetching http://www.variety.com/graphics/marketing/siriussurvey07_6.html
fetching http://www.variety.com/review/VE1117933972
fetching http://www.variety.com/article/VR1117967371
fetching http://www.variety.com/
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: mydir/crawldb
CrawlDb update: segments: [mydir/segments/20070621150010]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
LinkDb: starting
LinkDb: linkdb: mydir/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: mydir/segments/20070621145957
LinkDb: adding segment: mydir/segments/20070621150010
LinkDb: done
Indexer: starting
Indexer: linkdb: mydir/linkdb
Indexer: adding segment: mydir/segments/20070621145957
Indexer: adding segment: mydir/segments/20070621150010
 Indexing [http://www.variety.com/] with analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@1d2b9b7 (null)
 Indexing [http://www.variety.com/pilotwatch2007] with analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@1d2b9b7 (null)
Optimizing index.
merging segments _ram_0 (1 docs) _ram_1 (1 docs) into _0 (2 docs)
Indexer: done
Dedup: starting
Dedup: adding indexes in: mydir/indexes
Dedup: done
merging indexes to: mydir/index
Adding mydir/indexes/part-00000
done merging
crawl finished: mydir






       
____________________________________________________________________________________
Looking for a deal? Find great prices on flights and hotels with Yahoo! FareChase.
http://farechase.yahoo.com/