You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "John Lacey (JIRA)" <ji...@apache.org> on 2018/09/05 08:26:00 UTC

[jira] [Created] (NUTCH-2642) MoreIndexingFilter parses ISO 8601 UTC dates in local time zone

John Lacey created NUTCH-2642:
---------------------------------

             Summary: MoreIndexingFilter parses ISO 8601 UTC dates in local time zone
                 Key: NUTCH-2642
                 URL: https://issues.apache.org/jira/browse/NUTCH-2642
             Project: Nutch
          Issue Type: Bug
          Components: indexer
    Affects Versions: 1.15, 1.14
            Reporter: John Lacey


The ISO 8601 pattern in MoreIndexingFilter.getTime is "yyyy-MM-dd'T'HH:mm:ss'Z'". Note the literal Z.

[https://github.com/apache/nutch/blob/b834b81/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java#L142]

Apache commons-lang's DateUtils uses the local time zone by default when parsing, and can't tell that a string matching this pattern is specifying an offset because the pattern doesn't have an offset, just a literal "Z":

[https://github.com/apache/commons-lang/blob/b610707/src/main/java/org/apache/commons/lang3/time/DateUtils.java#L370]

So, when parsing a date string such as "2018-09-04T12:34:56Z", the time is returned as a local time:

DateUtils.parseDate("2018-09-04T12:34:56Z", new String[] \{ "yyyy-MM-dd'T'HH:mm:ss'Z'" })
=> Tue Sep 04 12:34:56 PDT 2018 (1536089696000)

I think a reasonable fix would be to specify an offset pattern instead of a literal "Z": "yyyy-MM-dd'T'HH:mm:ssXXX". That would also allow arbitrary offsets, as well as "Z":

DateUtils.parseDate("2018-09-04T12:34:56Z", new String[] \{ "yyyy-MM-dd'T'HH:mm:ssXXX" })
=> Tue Sep 04 05:34:56 PDT 2018 (1536064496000)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)