You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "shubham.gupta" <sh...@orkash.com> on 2017/05/16 12:09:54 UTC
No. of documents decreasing in 2nd fetch | Nutch 2.3.1 + hadoop 2.7.1
+ mongodb
Hey
I have a batch of 5000 seed URLs. I am trying to crawl these URLs by
utilizing the apache job created after the command "ant clean runtime"
is executed.
In the first 2 cycles of nutch workflow i.e.
inject->generate->fetch->parse->updatedb, it is working fine. Also, it
is able to fetch around 20,000 URLs. But, after the 2nd cycle, when the
workflow is executed, the no. of documents present with status 2 present
in the database start to decrease.
For example: the no. of documents with status 2 after the 2nd cycle were
22220 and the total number of links after updatedb present were 75882.
And after the 3rd cycle, documents with status 2 decreased to 22209 the
total no of links have increased to 78443. As checked in the logs, the
job is not resulting in any error. Unable to debug this. Are there some
changes that need to be made in the nutch configurations.
Please reply if any more details that need to be mentioned for a better
understanding of the problem. This is like a black box testing where I
am unable to come to a conclusion.
Please reply soon. Thanks in advance
Shubham Gupta