You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Deepa Jayaveer <de...@tcs.com> on 2014/06/24 14:29:45 UTC
reg crawled pages with status=2
Hi,
our requirement is that the Nutch should not recrawl crawl the pages
that was being already crawled.
ie., the crawling should not happen for the web pages with the status as
'2' in the webpage table. It should not recrawl and should
not put the outlinks as well.
can you please let me know whether it is possible by changing some
configuration parameters in nutch site xml?
Thanks and Regards
Deepa
=====-----=====-----=====
Notice: The information contained in this e-mail
message and/or attachments to it may contain
confidential or privileged information. If you are
not the intended recipient, any dissemination, use,
review, distribution, printing or copying of the
information contained in this e-mail message
and/or attachments to it are strictly prohibited. If
you have received this communication in error,
please notify us by reply e-mail or telephone and
immediately and permanently delete the message
and any attachments. Thank you