You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by dspathis <ds...@gmail.com> on 2012/03/27 21:26:12 UTC
Re-indexing temporarily unavailable page
Hi,
I'm having trouble with the following use case:
I use Nutch to crawl a web site and index the pages. When a page is
temporarily unavailable (404 - Not Found), I would like the page removed
from the index; when it comes back later, I would like it indexed again.
I can't get the page to be re-indexed. The reason appears to be that when
the page becomes available again, the fetch succeeds but the result is "304
- Not Modified" (this is what the server returns, since the page was not
modified in the meantime). I guess this is because the Nutch Fetcher is
using the If-Modified-Since header.
If Nutch did *not* use the If-Modified-Since header when a page's last
status was db_gone (or perhaps this could be configurable) that would be one
solution. Am I missing a simpler solution?
Thanks.
--
View this message in context: http://lucene.472066.n3.nabble.com/Re-indexing-temporarily-unavailable-page-tp3862504p3862504.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Re-indexing temporarily unavailable page
Posted by remi tassing <ta...@gmail.com>.
nice!
On Wed, Mar 28, 2012 at 10:52 PM, dspathis <ds...@gmail.com> wrote:
> I forgot to mention I'm using Nutch 1.4.
>
> For those interested, I solved my issue by modifying the protocol-http
> plugin, specifically the HttpResponse class.
>
> In the HttpResponse contstructor, I changed
>
> if (datum.getModifiedTime() > 0) {
> reqStr.append("If-Modified-Since: " +
> HttpDateFormat.toString(datum.getModifiedTime()));
> reqStr.append("\r\n");
> }
>
> to
>
> if (datum.getModifiedTime() > 0 && datum.getStatus() !=
> CrawlDatum.STATUS_DB_GONE) {
> reqStr.append("If-Modified-Since: " +
> HttpDateFormat.toString(datum.getModifiedTime()));
> reqStr.append("\r\n");
> }
>
> This way, the HTTP GET request is issued without the If-Modified-Since
> header if the page's last fetch status was db_gone.
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Re-indexing-temporarily-unavailable-page-tp3862504p3864748.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
Re: Re-indexing temporarily unavailable page
Posted by dspathis <ds...@gmail.com>.
I forgot to mention I'm using Nutch 1.4.
For those interested, I solved my issue by modifying the protocol-http
plugin, specifically the HttpResponse class.
In the HttpResponse contstructor, I changed
if (datum.getModifiedTime() > 0) {
reqStr.append("If-Modified-Since: " +
HttpDateFormat.toString(datum.getModifiedTime()));
reqStr.append("\r\n");
}
to
if (datum.getModifiedTime() > 0 && datum.getStatus() !=
CrawlDatum.STATUS_DB_GONE) {
reqStr.append("If-Modified-Since: " +
HttpDateFormat.toString(datum.getModifiedTime()));
reqStr.append("\r\n");
}
This way, the HTTP GET request is issued without the If-Modified-Since
header if the page's last fetch status was db_gone.
--
View this message in context: http://lucene.472066.n3.nabble.com/Re-indexing-temporarily-unavailable-page-tp3862504p3864748.html
Sent from the Nutch - User mailing list archive at Nabble.com.