You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "shubham.gupta" <sh...@orkash.com> on 2017/05/16 12:09:54 UTC

No. of documents decreasing in 2nd fetch | Nutch 2.3.1 + hadoop 2.7.1 + mongodb

Hey

I have a batch of 5000 seed URLs. I am trying to crawl these URLs by 
utilizing the apache job created after the command "ant clean runtime" 
is executed.
In the first 2 cycles of nutch workflow i.e. 
inject->generate->fetch->parse->updatedb, it is working fine. Also, it 
is able to fetch around 20,000 URLs. But, after the 2nd cycle, when the 
workflow is executed, the no. of documents present with status 2 present 
in the database start to decrease.

For example: the no. of documents with status 2 after the 2nd cycle were 
22220 and the total number of links after updatedb present were 75882.

And after the 3rd cycle, documents with status 2 decreased to 22209 the 
total no of links have increased to 78443. As checked in the logs, the 
job is not resulting in any error. Unable to debug this. Are there some 
changes that need to be made in the nutch configurations.

Please reply if  any more details that need to be mentioned for a better 
understanding of the problem. This is like a black box testing where I 
am unable to come to a conclusion.

Please reply soon. Thanks in advance

Shubham Gupta