You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by dspathis <ds...@gmail.com> on 2012/03/27 21:26:12 UTC

Re-indexing temporarily unavailable page

Hi,

I'm having trouble with the following use case:

I use Nutch to crawl a web site and index the pages. When a page is
temporarily unavailable (404 - Not Found), I would like the page removed
from the index; when it comes back later, I would like it indexed again.

I can't get the page to be re-indexed. The reason appears to be that when
the page becomes available again, the fetch succeeds but the result is "304
- Not Modified" (this is what the server returns, since the page was not
modified in the meantime). I guess this is because the Nutch Fetcher is
using the If-Modified-Since header.

If Nutch did *not* use the If-Modified-Since header when a page's last
status was db_gone (or perhaps this could be configurable) that would be one
solution. Am I missing a simpler solution?

Thanks.

--
View this message in context: http://lucene.472066.n3.nabble.com/Re-indexing-temporarily-unavailable-page-tp3862504p3862504.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Re-indexing temporarily unavailable page

Posted by remi tassing <ta...@gmail.com>.
nice!

On Wed, Mar 28, 2012 at 10:52 PM, dspathis <ds...@gmail.com> wrote:

> I forgot to mention I'm using Nutch 1.4.
>
> For those interested, I solved my issue by modifying the protocol-http
> plugin, specifically the HttpResponse class.
>
> In the HttpResponse contstructor, I changed
>
> if (datum.getModifiedTime() > 0) {
>  reqStr.append("If-Modified-Since: " +
> HttpDateFormat.toString(datum.getModifiedTime()));
>  reqStr.append("\r\n");
> }
>
> to
>
> if (datum.getModifiedTime() > 0 && datum.getStatus() !=
> CrawlDatum.STATUS_DB_GONE) {
>  reqStr.append("If-Modified-Since: " +
> HttpDateFormat.toString(datum.getModifiedTime()));
>  reqStr.append("\r\n");
> }
>
> This way, the HTTP GET request is issued without the If-Modified-Since
> header if the page's last fetch status was db_gone.
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Re-indexing-temporarily-unavailable-page-tp3862504p3864748.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Re: Re-indexing temporarily unavailable page

Posted by dspathis <ds...@gmail.com>.
I forgot to mention I'm using Nutch 1.4.

For those interested, I solved my issue by modifying the protocol-http
plugin, specifically the HttpResponse class.

In the HttpResponse contstructor, I changed

if (datum.getModifiedTime() > 0) {
  reqStr.append("If-Modified-Since: " +
HttpDateFormat.toString(datum.getModifiedTime()));
  reqStr.append("\r\n");
} 

to

if (datum.getModifiedTime() > 0 && datum.getStatus() !=
CrawlDatum.STATUS_DB_GONE) {
  reqStr.append("If-Modified-Since: " +
HttpDateFormat.toString(datum.getModifiedTime()));
  reqStr.append("\r\n");
} 

This way, the HTTP GET request is issued without the If-Modified-Since
header if the page's last fetch status was db_gone.


--
View this message in context: http://lucene.472066.n3.nabble.com/Re-indexing-temporarily-unavailable-page-tp3862504p3864748.html
Sent from the Nutch - User mailing list archive at Nabble.com.