You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/10/07 17:10:00 UTC

[jira] [Commented] (NUTCH-2642) MoreIndexingFilter parses ISO 8601 UTC dates in local time zone

    [ https://issues.apache.org/jira/browse/NUTCH-2642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16641140#comment-16641140 ] 

ASF GitHub Bot commented on NUTCH-2642:
---------------------------------------

sebastian-nagel closed pull request #385: NUTCH-2642 MoreIndexingFilter parses ISO 8601 UTC dates in local time zone
URL: https://github.com/apache/nutch/pull/385
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java b/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
index 3de951ab7..c16d23361 100644
--- a/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
+++ b/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
@@ -104,8 +104,10 @@ private NutchDocument addTime(NutchDocument doc, ParseData data, String url,
     String lastModified = data.getMeta(Metadata.LAST_MODIFIED);
     if (lastModified != null) { // try parse last-modified
       time = getTime(lastModified, url); // use as time
-                                         // store as string
-      doc.add("lastModified", new Date(time));
+                                         // store as Date
+      if (time > -1) {
+        doc.add("lastModified", new Date(time));
+      }
     }
 
     if (time == -1) { // if no last-modified specified in HTTP header
@@ -139,7 +141,7 @@ private long getTime(String date, String url) {
             "MMM dd yyyy HH:mm:ss. zzz", "MMM dd yyyy HH:mm:ss zzz",
             "dd.MM.yyyy HH:mm:ss zzz", "dd MM yyyy HH:mm:ss zzz",
             "dd.MM.yyyy; HH:mm:ss", "dd.MM.yyyy HH:mm:ss", "dd.MM.yyyy zzz",
-            "yyyy-MM-dd'T'HH:mm:ss'Z'" });
+            "yyyy-MM-dd'T'HH:mm:ssXXX" });
         time = parsedDate.getTime();
         // if (LOG.isWarnEnabled()) {
         // LOG.warn(url + ": parsed date: " + date +" to:"+time);
diff --git a/src/plugin/index-more/src/test/org/apache/nutch/indexer/more/TestMoreIndexingFilter.java b/src/plugin/index-more/src/test/org/apache/nutch/indexer/more/TestMoreIndexingFilter.java
index f918dde97..5a22cf6de 100644
--- a/src/plugin/index-more/src/test/org/apache/nutch/indexer/more/TestMoreIndexingFilter.java
+++ b/src/plugin/index-more/src/test/org/apache/nutch/indexer/more/TestMoreIndexingFilter.java
@@ -16,6 +16,9 @@
  */
 package org.apache.nutch.indexer.more;
 
+import java.time.format.DateTimeFormatter;
+import java.util.Date;
+
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.io.Text;
 import org.apache.nutch.crawl.CrawlDatum;
@@ -120,4 +123,42 @@ private void assertContentType(Configuration conf, String source,
     Assert.assertEquals("mime type not detected", expected,
         doc.getFieldValue("type"));
   }
+
+  @Test
+  public void testDates() throws IndexingException {
+    Configuration conf = NutchConfiguration.create();
+
+    Metadata metadata = new Metadata();
+    MoreIndexingFilter filter = new MoreIndexingFilter();
+    filter.setConf(conf);
+
+    Text url = new Text("http://www.example.com/");
+    ParseImpl parseImpl = new ParseImpl("text",
+        new ParseData(new ParseStatus(), "title", new Outlink[0], metadata));
+    CrawlDatum fetchDatum = new CrawlDatum();
+    NutchDocument doc = new NutchDocument();
+
+    long dateEpocheSeconds = 1537898340; // 2018-09-25T17:59:00+0000
+    fetchDatum.setModifiedTime(dateEpocheSeconds * 1000);
+    // fetch time 30 days later
+    fetchDatum.setFetchTime((dateEpocheSeconds + 30 * 24 * 60 * 60) * 1000);
+    // NOTE: the datum.getLastModified() returns the fetch time
+    // of the latest fetch of changed content (i.e., not a "HTTP 304
+    // not-modified" or a re-fetch resulting in the same signature)
+
+    doc = filter.filter(doc, parseImpl, url, fetchDatum, new Inlinks());
+
+    Assert.assertEquals("last fetch date not extracted",
+        new Date(dateEpocheSeconds * 1000), doc.getFieldValue("date"));
+
+    // set last-modified time (7 days before fetch time)
+    Date lastModifiedDate = new Date(
+        (dateEpocheSeconds - 7 * 24 * 60 * 60) * 1000);
+    String lastModifiedDateStr = DateTimeFormatter.ISO_INSTANT.format(lastModifiedDate.toInstant());
+    parseImpl.getData().getParseMeta().set(Metadata.LAST_MODIFIED, lastModifiedDateStr);
+    doc = filter.filter(doc, parseImpl, url, fetchDatum, new Inlinks());
+
+    Assert.assertEquals("last-modified date not extracted", lastModifiedDate,
+        doc.getFieldValue("lastModified"));
+  }
 }


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> MoreIndexingFilter parses ISO 8601 UTC dates in local time zone
> ---------------------------------------------------------------
>
>                 Key: NUTCH-2642
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2642
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer, plugin
>    Affects Versions: 2.3.1, 1.14, 1.15
>            Reporter: John Lacey
>            Priority: Minor
>             Fix For: 2.4, 1.16
>
>
> The ISO 8601 pattern in MoreIndexingFilter.getTime is "yyyy-MM-dd'T'HH:mm:ss'Z'". Note the literal Z.
> [https://github.com/apache/nutch/blob/b834b81/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java#L142]
> Apache commons-lang's DateUtils uses the local time zone by default when parsing, and can't tell that a string matching this pattern is specifying an offset because the pattern doesn't have an offset, just a literal "Z":
> [https://github.com/apache/commons-lang/blob/b610707/src/main/java/org/apache/commons/lang3/time/DateUtils.java#L370]
> So, when parsing a date string such as "2018-09-04T12:34:56Z", the time is returned as a local time:
> DateUtils.parseDate("2018-09-04T12:34:56Z", new String[] \{ "yyyy-MM-dd'T'HH:mm:ss'Z'" })
> => Tue Sep 04 12:34:56 PDT 2018 (1536089696000)
> I think a reasonable fix would be to specify an offset pattern instead of a literal "Z": "yyyy-MM-dd'T'HH:mm:ssXXX". That would also allow arbitrary offsets, as well as "Z":
> DateUtils.parseDate("2018-09-04T12:34:56Z", new String[] \{ "yyyy-MM-dd'T'HH:mm:ssXXX" })
> => Tue Sep 04 05:34:56 PDT 2018 (1536064496000)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)