You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Arthur Yarwood <ar...@fubaby.com> on 2016/03/05 23:33:10 UTC

Best tactic: Sites reporting a redirect instead of 404 gone.

I've noticed a number of sites I'm crawling and indexing, which happen 
to have fairly transient content I wish to index (lifespan of ~few 
weeks), are reporting a 301 permanent redirect, rather than a 404. The 
redirect just goes to a generic content no longer here page to be more 
helpful to normal web users. Not ideal at all, and not within my control 
at all.

What tactics and strategies can help mitigate this scenario?
In particular:
1) Removing these URL's from crawl DB (as they would if 404's and 
db.update.purge.404 = true).
2) Removing these from my Solr DB I'm indexing into.

I'm leaning towards the idea of writing an additional maintenance script 
that manually queries the crawldb for db_redir_perm status on urls from 
given hosts and manually removing these from Solr. I just fear it maybe 
over zealous in removing content from the index, in cases of a 
legitimate redirect...

Thanks!

-- 
Arthur Yarwood