You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Jeremy Calvert (JIRA)" <ji...@apache.org> on 2006/05/19 18:52:30 UTC

[jira] Updated: (NUTCH-270) Apply just the applicable portions of the patch to protocol.httpclient.Http.java

     [ http://issues.apache.org/jira/browse/NUTCH-270?page=all ]

Jeremy Calvert updated NUTCH-270:
---------------------------------


Hmm, it looks like the code in the patches provided in the parent issue is pretty dated.  I looked at the latest protocol-httpclient ( r407567 | ab | 2006-05-18 08:26:06 -0700 (Thu, 18 May 2006) ), and wrote the following crude patch to support conditional gets:

===================================================================
--- src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java      (revision 389177)
+++ src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java      (working copy)
@@ -32,6 +32,7 @@
 // Nutch imports
 import org.apache.nutch.crawl.CrawlDatum;
 import org.apache.nutch.metadata.Metadata;
+import org.apache.nutch.net.protocols.HttpDateFormat;
 import org.apache.nutch.net.protocols.Response;
 import org.apache.nutch.protocol.http.api.HttpBase;

@@ -69,6 +70,7 @@
     GetMethod get = new GetMethod(this.orig);
     get.setFollowRedirects(followRedirects);
     get.setRequestHeader("User-Agent", http.getUserAgent());
+    get.setRequestHeader("If-Modified-Since", HttpDateFormat.toString(datum.getModifiedTime()));
     HttpMethodParams params = get.getParams();
     // some servers cannot digest the new protocol
     params.setVersion(HttpVersion.HTTP_1_0);

===================================================================

One note: According to RFC2616, the client is supposed to send the modified time last gotten from the server with the If-Modified-Since header, but I couldn't find anywhere that CrawlDatum.setModifiedTime is actually called.


> Apply just the applicable portions of the patch to protocol.httpclient.Http.java
> --------------------------------------------------------------------------------
>
>          Key: NUTCH-270
>          URL: http://issues.apache.org/jira/browse/NUTCH-270
>      Project: Nutch
>         Type: Sub-task

>   Components: fetcher
>     Versions: 0.8-dev
>     Reporter: Jeremy Calvert

>
> This seems to be two issues in one.  Adaptive scheduling AND content change detection.
> I don't see any reason not to apply the patch to allow content change detection.  That is, the parts of th patch to support changing the signature HttpResponse(URL url, long lastModified).  It'd be especially useful for those of us who refetch feeds fairly frequently.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira