You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by David Philip <da...@gmail.com> on 2012/12/26 06:10:26 UTC

Nutch approach for DeadLinks

Hi  All,

     How does nutch  work with deadlinks? say for example, there is a blog
site being crawled today and all the blogs (documents) are indexed to solr.
Tomorrow, if one of the blog is deleted which mean that  the  URL indexed
yesterday is no more working today! In such cases,  How to update the solr
indexes such that this particular blog doesn’t come in search results?
Recrawling the same site didn’t delete this record in solr. How to handle
such cases? I am using nutch 1.5.1 bin.  Thanks David

RE: Nutch approach for DeadLinks

Posted by Markus Jelsma <ma...@openindex.io>.

Hi - Nutch 1.5 has a -deleteGone switch for the SolrIndexer job.This will delete permanent redirects and 404's that have been discovered during the crawl. 1.6 also has a  -deleteRobotsNoIndex that will delete pages that have a robots meta tag with a noindex value.
 
 
-----Original message-----
> From:David Philip <da...@gmail.com>
> Sent: Wed 26-Dec-2012 06:28
> To: user@nutch.apache.org
> Subject: Nutch approach for DeadLinks
> 
> Hi  All,
> 
>      How does nutch  work with deadlinks? say for example, there is a blog
> site being crawled today and all the blogs (documents) are indexed to solr.
> Tomorrow, if one of the blog is deleted which mean that  the  URL indexed
> yesterday is no more working today! In such cases,  How to update the solr
> indexes such that this particular blog doesn’t come in search results?
> Recrawling the same site didn’t delete this record in solr. How to handle
> such cases? I am using nutch 1.5.1 bin.  Thanks David
>