You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Deepa Jayaveer <de...@tcs.com> on 2014/06/24 14:29:45 UTC

reg crawled pages with status=2

Hi,
  our requirement is that the Nutch should not recrawl crawl the pages 
that was being already crawled. 
ie., the crawling should not happen for the web pages with the status as 
'2' in the webpage table. It should not recrawl and should
not put the outlinks as well.

can you please let me know whether it is possible by changing some 
configuration parameters in nutch site xml?

Thanks and Regards
Deepa
=====-----=====-----=====
Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. Thank you