You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Divjot Singh <di...@gmail.com> on 2015/07/13 10:25:52 UTC
Problem with crawling for nutch 2.3
Hi all
I have compiled nutch 2.3 code with gora 0.6 and using cloudera Hbase as
backend database. The code compiles fine and I am able to run it using the
bin/crawl command. The problem is that after fetching , it does not parse
all the urls that were fetched during this phase and skips them.
Secondly after iteration of 3 runs, it again starts to generate and fetch
the same pages as done in the previous runs. In short it processes the same
pages again and again without getting any new pages.
I have checked all the configurations. Have listed some.
* <name>db.update.additions.allowed</name>*
* <value>true</value>*
* <name>generate.max.count</name>*
* <value>-1</value>*
* <name>generate.max.distance</name>*
* <value>-1</value>*
* <name>generate.update.crawldb</name>*
* <value>true</value>*
* <name>db.fetch.interval.default</name>*
* <value>1209600</value>*
* <name> db.signature.classes</name>*
* <value>org.apache.nutch.crawl.TextProfileSignature</value>*
* <name>db.fetch.interval.max</name>*
* <value>7776000</value>*
* <name>db.max.outlinks.per.page</name>*
* <value>-1</value>*
* <name>parser.timeout</name>*
* <value>3000</value>*
* <name>http.content.limit</name>*
* <value>-1</value>*
So what I am doing wrong in all this ?
Thanks
Divjot