You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by kkfromus <kk...@gmail.com> on 2007/03/19 05:31:27 UTC

Nutch 0.8.1 issue with fetch

Iam a new user trying to configure nutch . Iam running into some issues . I
will appreciate if some one can help

Iam running nutch under cygwin . Iam trying to crawl the web site given in
the tutorial

i have a urls director and under that a url.text which has the entry
http://msn.com
I modified crawl-urlfilter.txt  to use the right domain
+^http://([a-z0-9]*\.)*msn.com/

When I run nutch using ./nutch crawl urls -dir c:/nutch/crawl -depth 5 -topN
50 , it runs . But during fetch i see an error 

fetching http://www.msn.com/
fetch of http://www.msn.com/ failed with: java.lang.NullPointerException
Fetcher: done

I looked at the readb stats but it says there is only one page

I looked through the tomcat search page and searched for msn . No results .
Can some one please help

Thanks
Kiran


Here is the full set of logs
$ ./nutch crawl urls -dir c:/nutch/crawl -depth 5 -topN 50
crawl started in: c:/nutch/crawl
rootUrlDir = urls
threads = 10
depth = 5
topN = 50
Injector: starting
Injector: crawlDb: c:/nutch/crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: starting
Generator: segment: c:/nutch/crawl/segments/20070318212902
Generator: Selecting best-scoring urls due for fetch.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: c:/nutch/crawl/segments/20070318212902
Fetcher: threads: 10
fetching http://www.msn.com/
fetch of http://www.msn.com/ failed with: java.lang.NullPointerException
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: c:/nutch/crawl/crawldb
CrawlDb update: segment: c:/nutch/crawl/segments/20070318212902
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: starting
Generator: segment: c:/nutch/crawl/segments/20070318212912
Generator: Selecting best-scoring urls due for fetch.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: c:/nutch/crawl/segments/20070318212912
Fetcher: threads: 10
fetching http://www.msn.com/
fetch of http://www.msn.com/ failed with: java.lang.NullPointerException
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: c:/nutch/crawl/crawldb
CrawlDb update: segment: c:/nutch/crawl/segments/20070318212912
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: starting
Generator: segment: c:/nutch/crawl/segments/20070318212920
Generator: Selecting best-scoring urls due for fetch.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: c:/nutch/crawl/segments/20070318212920
Fetcher: threads: 10
fetching http://www.msn.com/
fetch of http://www.msn.com/ failed with: java.lang.NullPointerException
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: c:/nutch/crawl/crawldb
CrawlDb update: segment: c:/nutch/crawl/segments/20070318212920
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: starting
Generator: segment: c:/nutch/crawl/segments/20070318212928
Generator: Selecting best-scoring urls due for fetch.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: c:/nutch/crawl/segments/20070318212928
Fetcher: threads: 10
fetching http://www.msn.com/
fetch of http://www.msn.com/ failed with: java.lang.NullPointerException
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: c:/nutch/crawl/crawldb
CrawlDb update: segment: c:/nutch/crawl/segments/20070318212928
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: starting
Generator: segment: c:/nutch/crawl/segments/20070318212936
Generator: Selecting best-scoring urls due for fetch.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: c:/nutch/crawl/segments/20070318212936
Fetcher: threads: 10
fetching http://www.msn.com/
fetch of http://www.msn.com/ failed with: java.lang.NullPointerException
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: c:/nutch/crawl/crawldb
CrawlDb update: segment: c:/nutch/crawl/segments/20070318212936
CrawlDb update: Merging segment data into db.
CrawlDb update: done
LinkDb: starting
LinkDb: linkdb: c:/nutch/crawl/linkdb
LinkDb: adding segment: c:/nutch/crawl/segments/20070318212902
LinkDb: adding segment: c:/nutch/crawl/segments/20070318212912
LinkDb: adding segment: c:/nutch/crawl/segments/20070318212920
LinkDb: adding segment: c:/nutch/crawl/segments/20070318212928
LinkDb: adding segment: c:/nutch/crawl/segments/20070318212936
LinkDb: done
Indexer: starting
Indexer: linkdb: c:/nutch/crawl/linkdb
Indexer: adding segment: c:/nutch/crawl/segments/20070318212902
Indexer: adding segment: c:/nutch/crawl/segments/20070318212912
Indexer: adding segment: c:/nutch/crawl/segments/20070318212920
Indexer: adding segment: c:/nutch/crawl/segments/20070318212928
Indexer: adding segment: c:/nutch/crawl/segments/20070318212936
Optimizing index.
Indexer: done
Dedup: starting
Dedup: adding indexes in: c:/nutch/crawl/indexes
Dedup: done
Adding c:/nutch/crawl/indexes/part-00000
crawl finished: c:/nutch/crawl





-- 
View this message in context: http://www.nabble.com/Nutch-0.8.1-issue-with-fetch-tf3425056.html#a9546446
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Nutch 0.8.1 issue with fetch

Posted by "Ratnesh,V2Solutions India" <ra...@in.v2solutions.com>.
First your problem is not very defined,
 but what I feel is null pointer exception comes because of crawler is not
able to crawl the pages, so better you check whether url what u mentioned is
correct, and then after crawl-urlfilter.txt, check out by giving
+^http://www.msn.com as whole path. 

and let me know whether have u done the settings for agent and robots in
nutch-default.xml

thanks


kkfromus wrote:
> 
> Iam a new user trying to configure nutch . Iam running into some issues .
> I will appreciate if some one can help
> 
> Iam running nutch under cygwin . Iam trying to crawl the web site given in
> the tutorial
> 
> i have a urls director and under that a url.text which has the entry
> http://msn.com
> I modified crawl-urlfilter.txt  to use the right domain
> +^http://([a-z0-9]*\.)*msn.com/
> 
> When I run nutch using ./nutch crawl urls -dir c:/nutch/crawl -depth 5
> -topN 50 , it runs . But during fetch i see an error 
> 
> fetching http://www.msn.com/
> fetch of http://www.msn.com/ failed with: java.lang.NullPointerException
> Fetcher: done
> 
> I looked at the readb stats but it says there is only one page
> 
> I looked through the tomcat search page and searched for msn . No results
> . Can some one please help
> 
> Thanks
> Kiran
> 
> 
> Here is the full set of logs
> $ ./nutch crawl urls -dir c:/nutch/crawl -depth 5 -topN 50
> crawl started in: c:/nutch/crawl
> rootUrlDir = urls
> threads = 10
> depth = 5
> topN = 50
> Injector: starting
> Injector: crawlDb: c:/nutch/crawl/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: done
> Generator: starting
> Generator: segment: c:/nutch/crawl/segments/20070318212902
> Generator: Selecting best-scoring urls due for fetch.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: c:/nutch/crawl/segments/20070318212902
> Fetcher: threads: 10
> fetching http://www.msn.com/
> fetch of http://www.msn.com/ failed with: java.lang.NullPointerException
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: c:/nutch/crawl/crawldb
> CrawlDb update: segment: c:/nutch/crawl/segments/20070318212902
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> Generator: starting
> Generator: segment: c:/nutch/crawl/segments/20070318212912
> Generator: Selecting best-scoring urls due for fetch.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: c:/nutch/crawl/segments/20070318212912
> Fetcher: threads: 10
> fetching http://www.msn.com/
> fetch of http://www.msn.com/ failed with: java.lang.NullPointerException
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: c:/nutch/crawl/crawldb
> CrawlDb update: segment: c:/nutch/crawl/segments/20070318212912
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> Generator: starting
> Generator: segment: c:/nutch/crawl/segments/20070318212920
> Generator: Selecting best-scoring urls due for fetch.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: c:/nutch/crawl/segments/20070318212920
> Fetcher: threads: 10
> fetching http://www.msn.com/
> fetch of http://www.msn.com/ failed with: java.lang.NullPointerException
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: c:/nutch/crawl/crawldb
> CrawlDb update: segment: c:/nutch/crawl/segments/20070318212920
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> Generator: starting
> Generator: segment: c:/nutch/crawl/segments/20070318212928
> Generator: Selecting best-scoring urls due for fetch.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: c:/nutch/crawl/segments/20070318212928
> Fetcher: threads: 10
> fetching http://www.msn.com/
> fetch of http://www.msn.com/ failed with: java.lang.NullPointerException
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: c:/nutch/crawl/crawldb
> CrawlDb update: segment: c:/nutch/crawl/segments/20070318212928
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> Generator: starting
> Generator: segment: c:/nutch/crawl/segments/20070318212936
> Generator: Selecting best-scoring urls due for fetch.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: c:/nutch/crawl/segments/20070318212936
> Fetcher: threads: 10
> fetching http://www.msn.com/
> fetch of http://www.msn.com/ failed with: java.lang.NullPointerException
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: c:/nutch/crawl/crawldb
> CrawlDb update: segment: c:/nutch/crawl/segments/20070318212936
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> LinkDb: starting
> LinkDb: linkdb: c:/nutch/crawl/linkdb
> LinkDb: adding segment: c:/nutch/crawl/segments/20070318212902
> LinkDb: adding segment: c:/nutch/crawl/segments/20070318212912
> LinkDb: adding segment: c:/nutch/crawl/segments/20070318212920
> LinkDb: adding segment: c:/nutch/crawl/segments/20070318212928
> LinkDb: adding segment: c:/nutch/crawl/segments/20070318212936
> LinkDb: done
> Indexer: starting
> Indexer: linkdb: c:/nutch/crawl/linkdb
> Indexer: adding segment: c:/nutch/crawl/segments/20070318212902
> Indexer: adding segment: c:/nutch/crawl/segments/20070318212912
> Indexer: adding segment: c:/nutch/crawl/segments/20070318212920
> Indexer: adding segment: c:/nutch/crawl/segments/20070318212928
> Indexer: adding segment: c:/nutch/crawl/segments/20070318212936
> Optimizing index.
> Indexer: done
> Dedup: starting
> Dedup: adding indexes in: c:/nutch/crawl/indexes
> Dedup: done
> Adding c:/nutch/crawl/indexes/part-00000
> crawl finished: c:/nutch/crawl
> 
> 
> 
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Nutch-0.8.1-issue-with-fetch-tf3425056.html#a9571555
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Nutch 0.8.1 issue with fetch

Posted by kkfromus <kk...@gmail.com>.
Never mind the spam . The problem was because the property value for the
agent name was empty

<property>
  <name>http.agent.name</name>
  <value>TestNutch</value>
  <description>Spareagent
  </description>
</property>


kkfromus wrote:
> 
> Iam a new user trying to configure nutch . Iam running into some issues .
> I will appreciate if some one can help
> 
> Iam running nutch under cygwin . Iam trying to crawl the web site given in
> the tutorial
> 
> i have a urls director and under that a url.text which has the entry
> http://msn.com
> I modified crawl-urlfilter.txt  to use the right domain
> +^http://([a-z0-9]*\.)*msn.com/
> 
> When I run nutch using ./nutch crawl urls -dir c:/nutch/crawl -depth 5
> -topN 50 , it runs . But during fetch i see an error 
> 
> fetching http://www.msn.com/
> fetch of http://www.msn.com/ failed with: java.lang.NullPointerException
> Fetcher: done
> 
> I looked at the readb stats but it says there is only one page
> 
> I looked through the tomcat search page and searched for msn . No results
> . Can some one please help
> 
> Thanks
> Kiran
> 
> 
> Here is the full set of logs
> $ ./nutch crawl urls -dir c:/nutch/crawl -depth 5 -topN 50
> crawl started in: c:/nutch/crawl
> rootUrlDir = urls
> threads = 10
> depth = 5
> topN = 50
> Injector: starting
> Injector: crawlDb: c:/nutch/crawl/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: done
> Generator: starting
> Generator: segment: c:/nutch/crawl/segments/20070318212902
> Generator: Selecting best-scoring urls due for fetch.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: c:/nutch/crawl/segments/20070318212902
> Fetcher: threads: 10
> fetching http://www.msn.com/
> fetch of http://www.msn.com/ failed with: java.lang.NullPointerException
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: c:/nutch/crawl/crawldb
> CrawlDb update: segment: c:/nutch/crawl/segments/20070318212902
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> Generator: starting
> Generator: segment: c:/nutch/crawl/segments/20070318212912
> Generator: Selecting best-scoring urls due for fetch.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: c:/nutch/crawl/segments/20070318212912
> Fetcher: threads: 10
> fetching http://www.msn.com/
> fetch of http://www.msn.com/ failed with: java.lang.NullPointerException
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: c:/nutch/crawl/crawldb
> CrawlDb update: segment: c:/nutch/crawl/segments/20070318212912
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> Generator: starting
> Generator: segment: c:/nutch/crawl/segments/20070318212920
> Generator: Selecting best-scoring urls due for fetch.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: c:/nutch/crawl/segments/20070318212920
> Fetcher: threads: 10
> fetching http://www.msn.com/
> fetch of http://www.msn.com/ failed with: java.lang.NullPointerException
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: c:/nutch/crawl/crawldb
> CrawlDb update: segment: c:/nutch/crawl/segments/20070318212920
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> Generator: starting
> Generator: segment: c:/nutch/crawl/segments/20070318212928
> Generator: Selecting best-scoring urls due for fetch.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: c:/nutch/crawl/segments/20070318212928
> Fetcher: threads: 10
> fetching http://www.msn.com/
> fetch of http://www.msn.com/ failed with: java.lang.NullPointerException
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: c:/nutch/crawl/crawldb
> CrawlDb update: segment: c:/nutch/crawl/segments/20070318212928
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> Generator: starting
> Generator: segment: c:/nutch/crawl/segments/20070318212936
> Generator: Selecting best-scoring urls due for fetch.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: c:/nutch/crawl/segments/20070318212936
> Fetcher: threads: 10
> fetching http://www.msn.com/
> fetch of http://www.msn.com/ failed with: java.lang.NullPointerException
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: c:/nutch/crawl/crawldb
> CrawlDb update: segment: c:/nutch/crawl/segments/20070318212936
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> LinkDb: starting
> LinkDb: linkdb: c:/nutch/crawl/linkdb
> LinkDb: adding segment: c:/nutch/crawl/segments/20070318212902
> LinkDb: adding segment: c:/nutch/crawl/segments/20070318212912
> LinkDb: adding segment: c:/nutch/crawl/segments/20070318212920
> LinkDb: adding segment: c:/nutch/crawl/segments/20070318212928
> LinkDb: adding segment: c:/nutch/crawl/segments/20070318212936
> LinkDb: done
> Indexer: starting
> Indexer: linkdb: c:/nutch/crawl/linkdb
> Indexer: adding segment: c:/nutch/crawl/segments/20070318212902
> Indexer: adding segment: c:/nutch/crawl/segments/20070318212912
> Indexer: adding segment: c:/nutch/crawl/segments/20070318212920
> Indexer: adding segment: c:/nutch/crawl/segments/20070318212928
> Indexer: adding segment: c:/nutch/crawl/segments/20070318212936
> Optimizing index.
> Indexer: done
> Dedup: starting
> Dedup: adding indexes in: c:/nutch/crawl/indexes
> Dedup: done
> Adding c:/nutch/crawl/indexes/part-00000
> crawl finished: c:/nutch/crawl
> 
> 
> 
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Nutch-0.8.1-issue-with-fetch-tf3425056.html#a9546898
Sent from the Nutch - User mailing list archive at Nabble.com.