You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Paul van Hoven <pa...@googlemail.com> on 2011/07/07 14:17:25 UTC

Problems with nutch tutorial

I'm completly new to nutch so I downloaded version 1.3 and worked 
through the beginners tutorial at 
http://wiki.apache.org/nutch/NutchTutorial. The first problem was that I 
did not find  the file "conf/crawl-urlfilter.txt" so I omitted that and 
continued with launiching nutch. Therefore I created a plain text file 
in "/Users/toom/Downloads/nutch-1.3/crawled" called "urls.txt" which 
contains the following text:

tom:crawled toom$ cat urls.txt
http://nutch.apache.org/

So after that I invoked nutch by calling
tom:bin toom$ ./nutch crawl /Users/toom/Downloads/nutch-1.3/crawled -dir 
/Users/toom/Downloads/nutch-1.3/sites -depth 3 -topN 50
solrUrl is not set, indexing will be skipped...
crawl started in: /Users/toom/Downloads/nutch-1.3/sites
rootUrlDir = /Users/toom/Downloads/nutch-1.3/crawled
threads = 10
depth = 3
solrUrl=null
topN = 50
Injector: starting at 2011-07-07 14:02:31
Injector: crawlDb: /Users/toom/Downloads/nutch-1.3/sites/crawldb
Injector: urlDir: /Users/toom/Downloads/nutch-1.3/crawled
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2011-07-07 14:02:35, elapsed: 00:00:03
Generator: starting at 2011-07-07 14:02:35
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 50
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: 
/Users/toom/Downloads/nutch-1.3/sites/segments/20110707140238
Generator: finished at 2011-07-07 14:02:39, elapsed: 00:00:04
Fetcher: No agents listed in 'http.agent.name' property.
Exception in thread "main" java.lang.IllegalArgumentException: Fetcher: 
No agents listed in 'http.agent.name' property.
     at 
org.apache.nutch.fetcher.Fetcher.checkConfiguration(Fetcher.java:1166)
     at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1068)
     at org.apache.nutch.crawl.Crawl.run(Crawl.java:135)
     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
     at org.apache.nutch.crawl.Crawl.main(Crawl.java:54)


I do not understand what happend here, maybe one of you can help me?

Re: Problems with nutch tutorial

Posted by Markus Jelsma <ma...@openindex.io>.

Hi 


You didn't follow bullet two of
-----


Good! You are almost ready to crawl. You need to give your crawler a name. 
This is required. 
Open up $NUTCH_HOME/conf/nutch-default.xml file 
Search for http.agent.name , and give it value 'YOURNAME Spider' 
Optionally you may also set http.agent.url and http.agent.email properties. 

------

this part of the tutorial. Actually, we recommend not modifying nutch-default, 
copy the properties you need to nutch-site.xml instead. 



On Thursday 07 July 2011 14:17:25 Paul van Hoven wrote:
> I'm completly new to nutch so I downloaded version 1.3 and worked
> through the beginners tutorial at
> http://wiki.apache.org/nutch/NutchTutorial. The first problem was that I
> did not find  the file "conf/crawl-urlfilter.txt" so I omitted that and
> continued with launiching nutch. Therefore I created a plain text file
> in "/Users/toom/Downloads/nutch-1.3/crawled" called "urls.txt" which
> contains the following text:
> 
> tom:crawled toom$ cat urls.txt
> http://nutch.apache.org/
> 
> So after that I invoked nutch by calling
> tom:bin toom$ ./nutch crawl /Users/toom/Downloads/nutch-1.3/crawled -dir
> /Users/toom/Downloads/nutch-1.3/sites -depth 3 -topN 50
> solrUrl is not set, indexing will be skipped...
> crawl started in: /Users/toom/Downloads/nutch-1.3/sites
> rootUrlDir = /Users/toom/Downloads/nutch-1.3/crawled
> threads = 10
> depth = 3
> solrUrl=null
> topN = 50
> Injector: starting at 2011-07-07 14:02:31
> Injector: crawlDb: /Users/toom/Downloads/nutch-1.3/sites/crawldb
> Injector: urlDir: /Users/toom/Downloads/nutch-1.3/crawled
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: finished at 2011-07-07 14:02:35, elapsed: 00:00:03
> Generator: starting at 2011-07-07 14:02:35
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: topN: 50
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls for politeness.
> Generator: segment:
> /Users/toom/Downloads/nutch-1.3/sites/segments/20110707140238
> Generator: finished at 2011-07-07 14:02:39, elapsed: 00:00:04
> Fetcher: No agents listed in 'http.agent.name' property.
> Exception in thread "main" java.lang.IllegalArgumentException: Fetcher:
> No agents listed in 'http.agent.name' property.
>      at
> org.apache.nutch.fetcher.Fetcher.checkConfiguration(Fetcher.java:1166)
>      at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1068)
>      at org.apache.nutch.crawl.Crawl.run(Crawl.java:135)
>      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>      at org.apache.nutch.crawl.Crawl.main(Crawl.java:54)
> 
> 
> I do not understand what happend here, maybe one of you can help me?

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Problems with nutch tutorial

Posted by Nutch User - 1 <nu...@gmail.com>.

On 07/07/2011 03:17 PM, Paul van Hoven wrote:
> I'm completly new to nutch so I downloaded version 1.3 and worked
> through the beginners tutorial at
> http://wiki.apache.org/nutch/NutchTutorial. The first problem was that
> I did not find  the file "conf/crawl-urlfilter.txt" so I omitted that
> and continued with launiching nutch. Therefore I created a plain text
> file in "/Users/toom/Downloads/nutch-1.3/crawled" called "urls.txt"
> which contains the following text:
>
> tom:crawled toom$ cat urls.txt
> http://nutch.apache.org/
>
> So after that I invoked nutch by calling
> tom:bin toom$ ./nutch crawl /Users/toom/Downloads/nutch-1.3/crawled
> -dir /Users/toom/Downloads/nutch-1.3/sites -depth 3 -topN 50
> solrUrl is not set, indexing will be skipped...
> crawl started in: /Users/toom/Downloads/nutch-1.3/sites
> rootUrlDir = /Users/toom/Downloads/nutch-1.3/crawled
> threads = 10
> depth = 3
> solrUrl=null
> topN = 50
> Injector: starting at 2011-07-07 14:02:31
> Injector: crawlDb: /Users/toom/Downloads/nutch-1.3/sites/crawldb
> Injector: urlDir: /Users/toom/Downloads/nutch-1.3/crawled
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: finished at 2011-07-07 14:02:35, elapsed: 00:00:03
> Generator: starting at 2011-07-07 14:02:35
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: topN: 50
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls for politeness.
> Generator: segment:
> /Users/toom/Downloads/nutch-1.3/sites/segments/20110707140238
> Generator: finished at 2011-07-07 14:02:39, elapsed: 00:00:04
> Fetcher: No agents listed in 'http.agent.name' property.
> Exception in thread "main" java.lang.IllegalArgumentException:
> Fetcher: No agents listed in 'http.agent.name' property.
>     at
> org.apache.nutch.fetcher.Fetcher.checkConfiguration(Fetcher.java:1166)
>     at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1068)
>     at org.apache.nutch.crawl.Crawl.run(Crawl.java:135)
>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>     at org.apache.nutch.crawl.Crawl.main(Crawl.java:54)
>
>
> I do not understand what happend here, maybe one of you can help me?
>
>

This seems trivial. From the tutorial:

"

Good! You are almost ready to crawl. You need to give your crawler a
name. This is required.

   1. Open up $NUTCH_HOME/conf/nutch-default.xml file
   2.

      Search for http.agent.name , and give it value 'YOURNAME Spider'

   3.

      Optionally you may also
      set http.agent.url and http.agent.email properties

"

Re: Problems with nutch tutorial

Posted by lewis john mcgibbney <le...@gmail.com>.

Hi Paul,

Please see this tutorial for working with Nutch 1.3 [1]

The tutorial you were using is for Nutch 1.2 from memory.

[1] http://wiki.apache.org/nutch/RunningNutchAndSolr

Thank you



On Thu, Jul 7, 2011 at 1:17 PM, Paul van Hoven <
paul.van.hoven@googlemail.com> wrote:

> I'm completly new to nutch so I downloaded version 1.3 and worked through
> the beginners tutorial at http://wiki.apache.org/nutch/**NutchTutorial<http://wiki.apache.org/nutch/NutchTutorial>.
> The first problem was that I did not find  the file
> "conf/crawl-urlfilter.txt" so I omitted that and continued with launiching
> nutch. Therefore I created a plain text file in
> "/Users/toom/Downloads/nutch-**1.3/crawled" called "urls.txt" which
> contains the following text:
>
> tom:crawled toom$ cat urls.txt
> http://nutch.apache.org/
>
> So after that I invoked nutch by calling
> tom:bin toom$ ./nutch crawl /Users/toom/Downloads/nutch-1.**3/crawled -dir
> /Users/toom/Downloads/nutch-1.**3/sites -depth 3 -topN 50
> solrUrl is not set, indexing will be skipped...
> crawl started in: /Users/toom/Downloads/nutch-1.**3/sites
> rootUrlDir = /Users/toom/Downloads/nutch-1.**3/crawled
> threads = 10
> depth = 3
> solrUrl=null
> topN = 50
> Injector: starting at 2011-07-07 14:02:31
> Injector: crawlDb: /Users/toom/Downloads/nutch-1.**3/sites/crawldb
> Injector: urlDir: /Users/toom/Downloads/nutch-1.**3/crawled
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: finished at 2011-07-07 14:02:35, elapsed: 00:00:03
> Generator: starting at 2011-07-07 14:02:35
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: topN: 50
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls for politeness.
> Generator: segment: /Users/toom/Downloads/nutch-1.**3/sites/segments/**
> 20110707140238
> Generator: finished at 2011-07-07 14:02:39, elapsed: 00:00:04
> Fetcher: No agents listed in 'http.agent.name' property.
> Exception in thread "main" java.lang.**IllegalArgumentException: Fetcher:
> No agents listed in 'http.agent.name' property.
>    at org.apache.nutch.fetcher.**Fetcher.checkConfiguration(**
> Fetcher.java:1166)
>    at org.apache.nutch.fetcher.**Fetcher.fetch(Fetcher.java:**1068)
>    at org.apache.nutch.crawl.Crawl.**run(Crawl.java:135)
>    at org.apache.hadoop.util.**ToolRunner.run(ToolRunner.**java:65)
>    at org.apache.nutch.crawl.Crawl.**main(Crawl.java:54)
>
>
> I do not understand what happend here, maybe one of you can help me?
>
>


-- 
*Lewis*