You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Otis Gospodnetic <ot...@gmail.com> on 2013/11/22 19:35:47 UTC

Not reading page body if page not modified?

Hi,

Is Nutch 2.x capable of issuing a GET request, comparing the reported
Last-Modified date with the last modified date from the previous fetch of a
page and, if the page is deemed unmodified since the last fetch, avoid
fetching the rest of the page?

.... and thus save bandwidth (and maybe speed up fetching)?

Thanks,
Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/

RE: Not reading page body if page not modified?

Posted by Markus Jelsma <ma...@openindex.io>.
Nutch 2.x is very similar to 1.x, the lib-http and protocol-http(client) did not really change. It is not possible out of the box in Nutch 1.7, there are no switches for this behaviour. I don't think this is easy to to with protocol-httpclient, unless HttpClient is already capable of this but you'd check the ancient javadocs to be sure. It is possible to patch protocol-http for this. The CrawlDatum is passes so you know the date and can stop reading bytes after the headers.

I don't think this is a good idea to implement. You won't really notice a faster fetcher unless you're processing many millions. You can also _cannot trust_ http headers, you are guaranteed to run into sites with crazy http headers and crazy values for last-modified. Nothing makes sense on the internet.

Anyway, most dynamic sites don't return that header so you'll have to compare digests anyway. You can then move to efficient fetching by using an adaptive fetch scheduler.
 
 
-----Original message-----
> From:Otis Gospodnetic <ot...@gmail.com>
> Sent: Friday 22nd November 2013 19:36
> To: Nutch User List <nu...@lucene.apache.org>
> Subject: Not reading page body if page not modified?
> 
> Hi,
> 
> Is Nutch 2.x capable of issuing a GET request, comparing the reported
> Last-Modified date with the last modified date from the previous fetch of a
> page and, if the page is deemed unmodified since the last fetch, avoid
> fetching the rest of the page?
> 
> .... and thus save bandwidth (and maybe speed up fetching)?
> 
> Thanks,
> Otis
> --
> Performance Monitoring * Log Analytics * Search Analytics
> Solr & Elasticsearch Support * http://sematext.com/
>