You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by toabhishek16 <to...@gmail.com> on 2007/12/19 09:32:38 UTC

error in running nutch 0.9

hi to all,
i am using nutch 0.9. i followed all the instruction given in the tutorial
of nutch 0.8 on the nutch site. i am behind the proxy (192.168.81.100).
below is nutch-default.xml, which i used to configure nutch.

<!-- HTTP properties -->

<property>
  <name>http.agent.name</name>
  <value>CDACPune</value>
  <description>HTTP 'User-Agent' request header. MUST NOT be empty -
  please set this to a single word uniquely related to your organization.

  NOTE: You should also check other related properties:

    http.robots.agents
    http.agent.description
    http.agent.url
    http.agent.email
    http.agent.version

  and set their values appropriately.

  </description>
</property>

<property>
  <name>http.robots.agents</name>
  <value>*</value>
  <description>The agent strings we'll look for in robots.txt files,
  comma-separated, in decreasing order of precedence. You should
  put the value of http.agent.name as the first agent name, and keep the
  default * at the end of the list. E.g.: BlurflDev,Blurfl,*
  </description>
</property>

<property>
  <name>http.robots.403.allow</name>
  <value>true</value>
  <description>Some servers return HTTP status 403 (Forbidden) if
  /robots.txt doesn't exist. This should probably mean that we are
  allowed to crawl the site nonetheless. If this is set to false,
  then such sites will be treated as forbidden.</description>
</property>

<property>
  <name>http.agent.description</name>
  <value>cdacp</value>
  <description>Further description of our bot- this text is used in
  the User-Agent header.  It appears in parenthesis after the agent name.
  </description>
</property>

<property>
  <name>http.agent.url</name>
  <value>www.cdac.in</value>
  <description>A URL to advertise in the User-Agent header.  This will
   appear in parenthesis after the agent name. Custom dictates that this
   should be a URL of a page explaining the purpose and behavior of this
   crawler.
  </description>
</property>

<property>
  <name>http.agent.email</name>
  <value>asoni@cdac.in</value>
  <description>An email address to advertise in the HTTP 'From' request
   header and User-Agent header. A good practice is to mangle this
   address (e.g. 'info at example dot com') to avoid spamming.
  </description>
</property>

<property>
  <name>http.agent.version</name>
  <value>Nutch-0.9</value>
  <description>A version string to advertise in the User-Agent
   header.</description>
</property>

<property>
  <name>http.timeout</name>
  <value>10000</value>
  <description>The default network timeout, in milliseconds.</description>
</property>

<property>
  <name>http.max.delays</name>
  <value>100</value>
  <description>The number of times a thread will delay when trying to
  fetch a page.  Each time it finds that a host is busy, it will wait
  fetcher.server.delay.  After http.max.delays attepts, it will give
  up on the page for now.</description>
</property>

<property>
  <name>http.content.limit</name>
  <value>65536</value>
  <description>The length limit for downloaded content, in bytes.
  If this value is nonnegative (>=0), content longer than it will be
truncated;
  otherwise, no truncation at all.
  </description>
</property>

<property>
  <name>http.proxy.host</name>
  <value>192.168.81.100</value>
  <description>The proxy hostname.  If empty, no proxy is
used.</description>
</property>

<property>
  <name>http.proxy.port</name>
  <value>8080</value>
  <description>The proxy port.</description>
</property>

<property>
  <name>http.verbose</name>
  <value>false</value>
  <description>If true, HTTP will log more verbosely.</description>
</property>

<property>
  <name>http.redirect.max</name>
  <value>0</value>
  <description>The maximum number of redirects the fetcher will follow when
  trying to fetch a page. If set to negative or 0, fetcher won't immediately
  follow redirected URLs, instead it will record them for later fetching.
  </description>
</property>




below is my crawl-urlfilter.txt

# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/.+?)/.*?\1/.*?\1/

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*apache.org/

# skip everything else
-.


and the url file is

http://apr.apache.org/docs/apr/1.2/structapr__getopt__option__t.html

while i am trying to crawl the web(i.e. inject, generate, fetch, updatedb,
invertlinks, index), i am getting the following messege and unable to crawl.

[root@localhost nutch-0.9]# bin/nutch inject crawl/crawldb urls
Injector: starting
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
[root@localhost nutch-0.9]# bin/nutch generate crawl/crawldb crawl/segments
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20071219121836
Generator: filtering: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
[root@localhost nutch-0.9]# s1=`ls -d crawl/segments/2* | tail -1`
[root@localhost nutch-0.9]# bin/nutch fetch $s1
Fetcher: starting
Fetcher: segment: crawl/segments/20071219121836
Fetcher: threads: 10
fetching
http://apr.apache.org/docs/apr/1.2/structapr__getopt__option__t.html
fetch of
http://apr.apache.org/docs/apr/1.2/structapr__getopt__option__t.html failed
with: org.apache.nutch.protocol.http.api.HttpException:
java.net.UnknownHostException: apr.apache.org: apr.apache.org
Fetcher: done
[root@localhost nutch-0.9]# bin/nutch updatedb crawl/crawldb $s1
CrawlDb update: starting
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20071219121836]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
CrawlDb update: Merging segment data into db.
CrawlDb update: done
[root@localhost nutch-0.9]# bin/nutch generate crawl/crawldb crawl/segments
-topN 1000
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20071219121929
Generator: filtering: true
Generator: topN: 1000
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
[root@localhost nutch-0.9]# s2=`ls -d crawl/segments/2* | tail -1`
[root@localhost nutch-0.9]# bin/nutch fetch $s2
Fetcher: starting
Fetcher: segment: crawl/segments/20071219121929
Fetcher: threads: 10
fetching
http://apr.apache.org/docs/apr/1.2/structapr__getopt__option__t.html
fetch of
http://apr.apache.org/docs/apr/1.2/structapr__getopt__option__t.html failed
with: org.apache.nutch.protocol.http.api.HttpException:
java.net.UnknownHostException: apr.apache.org: apr.apache.org
Fetcher: done
[root@localhost nutch-0.9]# bin/nutch updatedb crawl/crawldb $s2
CrawlDb update: starting
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20071219121929]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
CrawlDb update: Merging segment data into db.
CrawlDb update: done
[root@localhost nutch-0.9]# bin/nutch generate crawl/crawldb crawl/segments
-topN 1000
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20071219122019
Generator: filtering: true
Generator: topN: 1000
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
[root@localhost nutch-0.9]# s3=`ls -d crawl/segments/2* | tail -1`
[root@localhost nutch-0.9]# bin/nutch fetch $s3
Fetcher: starting
Fetcher: segment: crawl/segments/20071219122019
Fetcher: threads: 10
fetching
http://apr.apache.org/docs/apr/1.2/structapr__getopt__option__t.html
fetch of
http://apr.apache.org/docs/apr/1.2/structapr__getopt__option__t.html failed
with: org.apache.nutch.protocol.http.ap
i.HttpException: java.net.UnknownHostException: apr.apache.org:
apr.apache.org
Fetcher: done
[root@localhost nutch-0.9]# bin/nutch updatedb crawl/crawldb $s3
CrawlDb update: starting
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20071219122019]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
CrawlDb update: Merging segment data into db.
CrawlDb update: done
[root@localhost nutch-0.9]# bin/nutch invertlinks
crawl/linkdbcrawl/segments/*
LinkDb: starting
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: crawl/segments/20071219121836
LinkDb: adding segment: crawl/segments/20071219121929
LinkDb: adding segment: crawl/segments/20071219122019
LinkDb: done
[root@localhost nutch-0.9]# bin/nutch index crawl/indexes crawl/crawldb
crawl/linkdb crawl/segments/*
Indexer: starting
Indexer: linkdb: crawl/linkdb
Indexer: adding segment: crawl/segments/20071219121836
Indexer: adding segment: crawl/segments/20071219121929
Indexer: adding segment: crawl/segments/20071219122019
Optimizing index.
Indexer: done
[root@localhost nutch-0.9]# bin/nutch org.apache.nutch.searcher.NutchBean
apache
Total hits: 0



any one can help me to solve the problem.....
thx in advance.........

-- 
View this message in context: http://www.nabble.com/error-in-running-nutch-0.9-tp14412960p14412960.html
Sent from the Nutch - User mailing list archive at Nabble.com.