You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Divjot Singh <di...@gmail.com> on 2015/07/13 10:25:52 UTC

Problem with crawling for nutch 2.3

Hi all

I have compiled nutch 2.3 code with gora 0.6 and using cloudera Hbase as
backend database. The code compiles fine and I am able to run it using the
bin/crawl command. The problem is that after fetching , it does not parse
all the urls that were fetched during this phase and skips them.
Secondly after iteration of 3 runs, it again starts to generate and fetch
the same pages as done in the previous runs. In short it processes the same
pages again and again without getting any new pages.
I have checked all the configurations. Have listed some.


*  <name>db.update.additions.allowed</name>*
*  <value>true</value>*

*  <name>generate.max.count</name>*
*  <value>-1</value>*


*  <name>generate.max.distance</name>*
*  <value>-1</value>*

*  <name>generate.update.crawldb</name>*
*  <value>true</value>*

*  <name>db.fetch.interval.default</name>*
*  <value>1209600</value>*

*  <name> db.signature.classes</name>*
*  <value>org.apache.nutch.crawl.TextProfileSignature</value>*

*  <name>db.fetch.interval.max</name>*
*  <value>7776000</value>*

*  <name>db.max.outlinks.per.page</name>*
*  <value>-1</value>*

*  <name>parser.timeout</name>*
*  <value>3000</value>*

*  <name>http.content.limit</name>*
*  <value>-1</value>*

So what I am doing wrong in all this ?

Thanks
Divjot