You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Davide Cavalaglio <da...@desktopsrl.com> on 2010/10/27 12:28:15 UTC

If-Modified-Since header with Nutch

Hi,
i have problem with the option If-Modified-Since with Nutch.
I want crawl on a web syte every day, so i have in nutch-site.html the
right setting of property db.fetch.interval.default.
But i want to limit Nutch to fetch only page that changed using the
If-Modified-Since header.

I found some resources on web to do this task, but when i recrawl page
afeter fetch-interval, nutch download all pages. I use Nutch 1.0 whith
protocol http. I don't use Adaptive Scheduler. In HttpResponse.java i
added the code:
if (datum.getModifiedTime() > 0) {
          String httpDate = HttpDateFormat.toString(datum.getModifiedTime());
          Http.LOG.debug("modified time: " + httpDate);
          reqStr.append("If-Modified-Since: " + httpDate);
          reqStr.append("\r\n");
      }
      else if (datum.getFetchTime() > 0) {
         String httpDate = HttpDateFormat.toString(datum.getFetchTime());
         Http.LOG.debug("modified time: " + httpDate);
         reqStr.append("If-Modified-Since: " + httpDate);
         reqStr.append("\r\n");
      }

      reqStr.append("\r\n");

because there was a bug that prevent the use of If-Modified-Since.
Also i did other change to Fetcher.java so i have the correct value of
LastModified in the CrawlDb
I try to crawl other web site because i want understand if it is a
problem of my web server that not support if-modified-since. But in
every test, i have always response code 200 even if the lastModified
of web page is older than LastModified in CrawlDb.

Can anyone tell me how to correctly use the If-Modified-Since?
Thanks,
Cavalaglio Davide

Re: If-Modified-Since header with Nutch

Posted by Hannes Carl Meyer <ha...@googlemail.com>.
Hi,

did you solve the problem yourself?
I'm running in the same Issue...

Maybe someone else could help here?

Regards

Hannes

On Wed, Oct 27, 2010 at 12:28 PM, Davide Cavalaglio <
davide.cavalaglio@desktopsrl.com> wrote:

> Hi,
> i have problem with the option If-Modified-Since with Nutch.
> I want crawl on a web syte every day, so i have in nutch-site.html the
> right setting of property db.fetch.interval.default.
> But i want to limit Nutch to fetch only page that changed using the
> If-Modified-Since header.
>
> I found some resources on web to do this task, but when i recrawl page
> afeter fetch-interval, nutch download all pages. I use Nutch 1.0 whith
> protocol http. I don't use Adaptive Scheduler. In HttpResponse.java i
> added the code:
> if (datum.getModifiedTime() > 0) {
>           String httpDate =
> HttpDateFormat.toString(datum.getModifiedTime());
>           Http.LOG.debug("modified time: " + httpDate);
>           reqStr.append("If-Modified-Since: " + httpDate);
>           reqStr.append("\r\n");
>       }
>       else if (datum.getFetchTime() > 0) {
>          String httpDate = HttpDateFormat.toString(datum.getFetchTime());
>          Http.LOG.debug("modified time: " + httpDate);
>          reqStr.append("If-Modified-Since: " + httpDate);
>          reqStr.append("\r\n");
>       }
>
>       reqStr.append("\r\n");
>
> because there was a bug that prevent the use of If-Modified-Since.
> Also i did other change to Fetcher.java so i have the correct value of
> LastModified in the CrawlDb
> I try to crawl other web site because i want understand if it is a
> problem of my web server that not support if-modified-since. But in
> every test, i have always response code 200 even if the lastModified
> of web page is older than LastModified in CrawlDb.
>
> Can anyone tell me how to correctly use the If-Modified-Since?
> Thanks,
> Cavalaglio Davide
>