You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Davide Cavalaglio <da...@desktopsrl.com> on 2010/10/27 12:28:15 UTC
If-Modified-Since header with Nutch
Hi,
i have problem with the option If-Modified-Since with Nutch.
I want crawl on a web syte every day, so i have in nutch-site.html the
right setting of property db.fetch.interval.default.
But i want to limit Nutch to fetch only page that changed using the
If-Modified-Since header.
I found some resources on web to do this task, but when i recrawl page
afeter fetch-interval, nutch download all pages. I use Nutch 1.0 whith
protocol http. I don't use Adaptive Scheduler. In HttpResponse.java i
added the code:
if (datum.getModifiedTime() > 0) {
String httpDate = HttpDateFormat.toString(datum.getModifiedTime());
Http.LOG.debug("modified time: " + httpDate);
reqStr.append("If-Modified-Since: " + httpDate);
reqStr.append("\r\n");
}
else if (datum.getFetchTime() > 0) {
String httpDate = HttpDateFormat.toString(datum.getFetchTime());
Http.LOG.debug("modified time: " + httpDate);
reqStr.append("If-Modified-Since: " + httpDate);
reqStr.append("\r\n");
}
reqStr.append("\r\n");
because there was a bug that prevent the use of If-Modified-Since.
Also i did other change to Fetcher.java so i have the correct value of
LastModified in the CrawlDb
I try to crawl other web site because i want understand if it is a
problem of my web server that not support if-modified-since. But in
every test, i have always response code 200 even if the lastModified
of web page is older than LastModified in CrawlDb.
Can anyone tell me how to correctly use the If-Modified-Since?
Thanks,
Cavalaglio Davide
Re: If-Modified-Since header with Nutch
Posted by Hannes Carl Meyer <ha...@googlemail.com>.
Hi,
did you solve the problem yourself?
I'm running in the same Issue...
Maybe someone else could help here?
Regards
Hannes
On Wed, Oct 27, 2010 at 12:28 PM, Davide Cavalaglio <
davide.cavalaglio@desktopsrl.com> wrote:
> Hi,
> i have problem with the option If-Modified-Since with Nutch.
> I want crawl on a web syte every day, so i have in nutch-site.html the
> right setting of property db.fetch.interval.default.
> But i want to limit Nutch to fetch only page that changed using the
> If-Modified-Since header.
>
> I found some resources on web to do this task, but when i recrawl page
> afeter fetch-interval, nutch download all pages. I use Nutch 1.0 whith
> protocol http. I don't use Adaptive Scheduler. In HttpResponse.java i
> added the code:
> if (datum.getModifiedTime() > 0) {
> String httpDate =
> HttpDateFormat.toString(datum.getModifiedTime());
> Http.LOG.debug("modified time: " + httpDate);
> reqStr.append("If-Modified-Since: " + httpDate);
> reqStr.append("\r\n");
> }
> else if (datum.getFetchTime() > 0) {
> String httpDate = HttpDateFormat.toString(datum.getFetchTime());
> Http.LOG.debug("modified time: " + httpDate);
> reqStr.append("If-Modified-Since: " + httpDate);
> reqStr.append("\r\n");
> }
>
> reqStr.append("\r\n");
>
> because there was a bug that prevent the use of If-Modified-Since.
> Also i did other change to Fetcher.java so i have the correct value of
> LastModified in the CrawlDb
> I try to crawl other web site because i want understand if it is a
> problem of my web server that not support if-modified-since. But in
> every test, i have always response code 200 even if the lastModified
> of web page is older than LastModified in CrawlDb.
>
> Can anyone tell me how to correctly use the If-Modified-Since?
> Thanks,
> Cavalaglio Davide
>