You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Nick Burch (JIRA)" <ji...@apache.org> on 2010/06/30 13:41:49 UTC

[jira] Created: (TIKA-451) Inconsistent date format for Metadata.CREATION_DATE and Metadata.LAST_MODIFIED

Inconsistent date format for Metadata.CREATION_DATE and Metadata.LAST_MODIFIED
------------------------------------------------------------------------------

                 Key: TIKA-451
                 URL: https://issues.apache.org/jira/browse/TIKA-451
             Project: Tika
          Issue Type: Improvement
          Components: metadata, parser
    Affects Versions: 0.7
            Reporter: Nick Burch
            Priority: Minor


Currently, the PDF Parser does   calendar.getTime().toString()   which means dates end up in your local timezone, and are hard to parse

The Open Document parsers output in iso 8601 format, which avoids these two problems

The poi ole2 based parsers also output in date.toString() format, with the same timezone/parsing problems

We should probably select one format, and update the parsers to all output in it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: [jira] Commented: (TIKA-451) Inconsistent date format for Metadata.CREATION_DATE and Metadata.LAST_MODIFIED

Posted by Oleg Tikhonov <ol...@gmail.com>.
+1, bull's eye.

On Tue, Jul 6, 2010 at 8:41 PM, Chris A. Mattmann (JIRA) <ji...@apache.org>wrote:

>
>    [
> https://issues.apache.org/jira/browse/TIKA-451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12885614#action_12885614]
>
> Chris A. Mattmann commented on TIKA-451:
> ----------------------------------------
>
> +1 to Jukka's suggestion...
>
> > Inconsistent date format for Metadata.CREATION_DATE and
> Metadata.LAST_MODIFIED
> >
> ------------------------------------------------------------------------------
> >
> >                 Key: TIKA-451
> >                 URL: https://issues.apache.org/jira/browse/TIKA-451
> >             Project: Tika
> >          Issue Type: Improvement
> >          Components: metadata, parser
> >    Affects Versions: 0.7
> >            Reporter: Nick Burch
> >            Priority: Minor
> >
> > Currently, the PDF Parser does   calendar.getTime().toString()   which
> means dates end up in your local timezone, and are hard to parse
> > The Open Document parsers output in iso 8601 format, which avoids these
> two problems
> > The poi ole2 based parsers also output in date.toString() format, with
> the same timezone/parsing problems
> > We should probably select one format, and update the parsers to all
> output in it
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>


-- 
Best regards, Oleg.

[jira] Commented: (TIKA-451) Inconsistent date format for Metadata.CREATION_DATE and Metadata.LAST_MODIFIED

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12885604#action_12885604 ] 

Jukka Zitting commented on TIKA-451:
------------------------------------

See page 11 of http://www.adobe.com/devnet/xmp/pdfs/XMPSpecificationPart2.pdf for the ISO 8601 subset used by XMP. I think that matches our needs pretty well.

One of my forward-looking ideas behind introducing the Property class was to use it for these kinds of type-safe value conversions. We could add Property.setDate(Metadata, Date) and Property.getDate(Metadata) methods that could also take advantage of the static value type information included in the Property constants. For example an integer property constant could throw an exception (or use some predefined conversion rule) when you attempt to get its value as a date. For added compile-time type-safety we could even add explicit DateProperty, IntegerProperty, etc. subclasses for specific kinds of metadata properties.

> Inconsistent date format for Metadata.CREATION_DATE and Metadata.LAST_MODIFIED
> ------------------------------------------------------------------------------
>
>                 Key: TIKA-451
>                 URL: https://issues.apache.org/jira/browse/TIKA-451
>             Project: Tika
>          Issue Type: Improvement
>          Components: metadata, parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Priority: Minor
>
> Currently, the PDF Parser does   calendar.getTime().toString()   which means dates end up in your local timezone, and are hard to parse
> The Open Document parsers output in iso 8601 format, which avoids these two problems
> The poi ole2 based parsers also output in date.toString() format, with the same timezone/parsing problems
> We should probably select one format, and update the parsers to all output in it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-451) Inconsistent date format for Metadata.CREATION_DATE and Metadata.LAST_MODIFIED

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884415#action_12884415 ] 

Nick Burch commented on TIKA-451:
---------------------------------

OK, makes sense to me

As we have several parsers which currently have a Date object (or a Calendar one that can yield a Date), we probably want to put the Date -> ISO 8601 string conversion in one place to save duplication. I think adding lots of overloaded methods to the Metadata object might make things a little ugly (eg set+add with String+Property, possibly for both Date and Calendar....)

One option I see is a single overloaded set(Property,Date), since we shouldn't need to handle multiple Dates so don't need an add. This would involve switching a couple of the Metadata keys from String to Property though (but I don't think this should affect many users, if any)

The other option is to add a static helper method, probably on Metadata but it needn't have to be, of something like "public static String formatDate(Date d)" and "public static String formatDate(Calendar c)", then keep the rest of the Metadata object as-is, and require the parsers to use the helper to do date -> string before storing the string.

Since we do have set(Property,int), I'd probably lean towards the former option. What does everyone else think?

Nick

> Inconsistent date format for Metadata.CREATION_DATE and Metadata.LAST_MODIFIED
> ------------------------------------------------------------------------------
>
>                 Key: TIKA-451
>                 URL: https://issues.apache.org/jira/browse/TIKA-451
>             Project: Tika
>          Issue Type: Improvement
>          Components: metadata, parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Priority: Minor
>
> Currently, the PDF Parser does   calendar.getTime().toString()   which means dates end up in your local timezone, and are hard to parse
> The Open Document parsers output in iso 8601 format, which avoids these two problems
> The poi ole2 based parsers also output in date.toString() format, with the same timezone/parsing problems
> We should probably select one format, and update the parsers to all output in it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-451) Inconsistent date format for Metadata.CREATION_DATE and Metadata.LAST_MODIFIED

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883947#action_12883947 ] 

Chris A. Mattmann commented on TIKA-451:
----------------------------------------

+1 for ISO8601. It's cross database, cross platform and lexographically sortable. 

> Inconsistent date format for Metadata.CREATION_DATE and Metadata.LAST_MODIFIED
> ------------------------------------------------------------------------------
>
>                 Key: TIKA-451
>                 URL: https://issues.apache.org/jira/browse/TIKA-451
>             Project: Tika
>          Issue Type: Improvement
>          Components: metadata, parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Priority: Minor
>
> Currently, the PDF Parser does   calendar.getTime().toString()   which means dates end up in your local timezone, and are hard to parse
> The Open Document parsers output in iso 8601 format, which avoids these two problems
> The poi ole2 based parsers also output in date.toString() format, with the same timezone/parsing problems
> We should probably select one format, and update the parsers to all output in it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-451) Inconsistent date format for Metadata.CREATION_DATE and Metadata.LAST_MODIFIED

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886016#action_12886016 ] 

Chris A. Mattmann commented on TIKA-451:
----------------------------------------

+1 to throwing a new PropertyTypeException, for now. Another option would be to call it PropertyValidationException, to signify that we may (later) include the ability to attach custom validators to a met object but that might be a bit too heavyweight!

> Inconsistent date format for Metadata.CREATION_DATE and Metadata.LAST_MODIFIED
> ------------------------------------------------------------------------------
>
>                 Key: TIKA-451
>                 URL: https://issues.apache.org/jira/browse/TIKA-451
>             Project: Tika
>          Issue Type: Improvement
>          Components: metadata, parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>            Priority: Minor
>
> Currently, the PDF Parser does   calendar.getTime().toString()   which means dates end up in your local timezone, and are hard to parse
> The Open Document parsers output in iso 8601 format, which avoids these two problems
> The poi ole2 based parsers also output in date.toString() format, with the same timezone/parsing problems
> We should probably select one format, and update the parsers to all output in it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-451) Inconsistent date format for Metadata.CREATION_DATE and Metadata.LAST_MODIFIED

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886193#action_12886193 ] 

Chris A. Mattmann commented on TIKA-451:
----------------------------------------

Agreed, +1 from me too here...

> Inconsistent date format for Metadata.CREATION_DATE and Metadata.LAST_MODIFIED
> ------------------------------------------------------------------------------
>
>                 Key: TIKA-451
>                 URL: https://issues.apache.org/jira/browse/TIKA-451
>             Project: Tika
>          Issue Type: Improvement
>          Components: metadata, parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>            Priority: Minor
>
> Currently, the PDF Parser does   calendar.getTime().toString()   which means dates end up in your local timezone, and are hard to parse
> The Open Document parsers output in iso 8601 format, which avoids these two problems
> The poi ole2 based parsers also output in date.toString() format, with the same timezone/parsing problems
> We should probably select one format, and update the parsers to all output in it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-451) Inconsistent date format for Metadata.CREATION_DATE and Metadata.LAST_MODIFIED

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905931#action_12905931 ] 

Nick Burch commented on TIKA-451:
---------------------------------

I've applied Staffan's patch as part of the larger changes in TIKA-482

> Inconsistent date format for Metadata.CREATION_DATE and Metadata.LAST_MODIFIED
> ------------------------------------------------------------------------------
>
>                 Key: TIKA-451
>                 URL: https://issues.apache.org/jira/browse/TIKA-451
>             Project: Tika
>          Issue Type: Improvement
>          Components: metadata, parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>            Priority: Minor
>
> Currently, the PDF Parser does   calendar.getTime().toString()   which means dates end up in your local timezone, and are hard to parse
> The Open Document parsers output in iso 8601 format, which avoids these two problems
> The poi ole2 based parsers also output in date.toString() format, with the same timezone/parsing problems
> We should probably select one format, and update the parsers to all output in it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-451) Inconsistent date format for Metadata.CREATION_DATE and Metadata.LAST_MODIFIED

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886116#action_12886116 ] 

Nick Burch commented on TIKA-451:
---------------------------------

Well, there are two validation steps. Firstly, for integers, we have a pair of asserts that check when you do set(property,int) that the property is both simple and int based. Those could certainly be replaced with test + throw PropertyTypeException. (We'll want the same for getDate(property) for non date property definitions)

Then there's the get when the string value is of the wrong type (eg should be date but isn't in the right format). That could be PropertyValidationException or similar. Or we could make them both the same exception for now?

> Inconsistent date format for Metadata.CREATION_DATE and Metadata.LAST_MODIFIED
> ------------------------------------------------------------------------------
>
>                 Key: TIKA-451
>                 URL: https://issues.apache.org/jira/browse/TIKA-451
>             Project: Tika
>          Issue Type: Improvement
>          Components: metadata, parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>            Priority: Minor
>
> Currently, the PDF Parser does   calendar.getTime().toString()   which means dates end up in your local timezone, and are hard to parse
> The Open Document parsers output in iso 8601 format, which avoids these two problems
> The poi ole2 based parsers also output in date.toString() format, with the same timezone/parsing problems
> We should probably select one format, and update the parsers to all output in it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (TIKA-451) Inconsistent date format for Metadata.CREATION_DATE and Metadata.LAST_MODIFIED

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Burch reassigned TIKA-451:
-------------------------------

    Assignee: Nick Burch

> Inconsistent date format for Metadata.CREATION_DATE and Metadata.LAST_MODIFIED
> ------------------------------------------------------------------------------
>
>                 Key: TIKA-451
>                 URL: https://issues.apache.org/jira/browse/TIKA-451
>             Project: Tika
>          Issue Type: Improvement
>          Components: metadata, parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>            Priority: Minor
>
> Currently, the PDF Parser does   calendar.getTime().toString()   which means dates end up in your local timezone, and are hard to parse
> The Open Document parsers output in iso 8601 format, which avoids these two problems
> The poi ole2 based parsers also output in date.toString() format, with the same timezone/parsing problems
> We should probably select one format, and update the parsers to all output in it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-451) Inconsistent date format for Metadata.CREATION_DATE and Metadata.LAST_MODIFIED

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884407#action_12884407 ] 

Chris A. Mattmann commented on TIKA-451:
----------------------------------------

Hi Nick,

We might impact users, but pre-1.0 and pre any documented expectation of return met Date formats, I think that this change is acceptable. 

ISO 8601 seems to be a good standard to comply to and I think the time is ripe to do it...

Cheers,
Chris

> Inconsistent date format for Metadata.CREATION_DATE and Metadata.LAST_MODIFIED
> ------------------------------------------------------------------------------
>
>                 Key: TIKA-451
>                 URL: https://issues.apache.org/jira/browse/TIKA-451
>             Project: Tika
>          Issue Type: Improvement
>          Components: metadata, parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Priority: Minor
>
> Currently, the PDF Parser does   calendar.getTime().toString()   which means dates end up in your local timezone, and are hard to parse
> The Open Document parsers output in iso 8601 format, which avoids these two problems
> The poi ole2 based parsers also output in date.toString() format, with the same timezone/parsing problems
> We should probably select one format, and update the parsers to all output in it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-451) Inconsistent date format for Metadata.CREATION_DATE and Metadata.LAST_MODIFIED

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12885614#action_12885614 ] 

Chris A. Mattmann commented on TIKA-451:
----------------------------------------

+1 to Jukka's suggestion...

> Inconsistent date format for Metadata.CREATION_DATE and Metadata.LAST_MODIFIED
> ------------------------------------------------------------------------------
>
>                 Key: TIKA-451
>                 URL: https://issues.apache.org/jira/browse/TIKA-451
>             Project: Tika
>          Issue Type: Improvement
>          Components: metadata, parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Priority: Minor
>
> Currently, the PDF Parser does   calendar.getTime().toString()   which means dates end up in your local timezone, and are hard to parse
> The Open Document parsers output in iso 8601 format, which avoids these two problems
> The poi ole2 based parsers also output in date.toString() format, with the same timezone/parsing problems
> We should probably select one format, and update the parsers to all output in it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-451) Inconsistent date format for Metadata.CREATION_DATE and Metadata.LAST_MODIFIED

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12888590#action_12888590 ] 

Nick Burch commented on TIKA-451:
---------------------------------

I've made the suggested enhancement to make Metadata.CREATION_DATE and Metadata.LAST_MODIFIED Date properties, with appropriate setters, getters and invalid logic. Committed in r964235. Now I just need to update the parsers to make use of this.

> Inconsistent date format for Metadata.CREATION_DATE and Metadata.LAST_MODIFIED
> ------------------------------------------------------------------------------
>
>                 Key: TIKA-451
>                 URL: https://issues.apache.org/jira/browse/TIKA-451
>             Project: Tika
>          Issue Type: Improvement
>          Components: metadata, parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>            Priority: Minor
>
> Currently, the PDF Parser does   calendar.getTime().toString()   which means dates end up in your local timezone, and are hard to parse
> The Open Document parsers output in iso 8601 format, which avoids these two problems
> The poi ole2 based parsers also output in date.toString() format, with the same timezone/parsing problems
> We should probably select one format, and update the parsers to all output in it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-451) Inconsistent date format for Metadata.CREATION_DATE and Metadata.LAST_MODIFIED

Posted by "Staffan Olsson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899303#action_12899303 ] 

Staffan Olsson commented on TIKA-451:
-------------------------------------

Had to make a fix for the mbox format: http://github.com/solsson/tika/commit/0b3eaa9f2dd927a14823fb519a168659ed4fa1c1

> Inconsistent date format for Metadata.CREATION_DATE and Metadata.LAST_MODIFIED
> ------------------------------------------------------------------------------
>
>                 Key: TIKA-451
>                 URL: https://issues.apache.org/jira/browse/TIKA-451
>             Project: Tika
>          Issue Type: Improvement
>          Components: metadata, parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>            Priority: Minor
>
> Currently, the PDF Parser does   calendar.getTime().toString()   which means dates end up in your local timezone, and are hard to parse
> The Open Document parsers output in iso 8601 format, which avoids these two problems
> The poi ole2 based parsers also output in date.toString() format, with the same timezone/parsing problems
> We should probably select one format, and update the parsers to all output in it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-451) Inconsistent date format for Metadata.CREATION_DATE and Metadata.LAST_MODIFIED

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884339#action_12884339 ] 

Nick Burch commented on TIKA-451:
---------------------------------

If we make the change, it could impact users, but I think not too much

Currently, there are a number of different date formats that crop up in the date field. This means that anyone who cares about the format is already having to try multiple date patterns to parse it. So, they shouldn't be affected by a change to a pre-existing format.

The only people I can see being affected are people who only ever use one of the date.toString() parsers, and no others, and who assume that format on all dates. Hopefully that's a rare enough use case that we don't need to worry about when making this change?

> Inconsistent date format for Metadata.CREATION_DATE and Metadata.LAST_MODIFIED
> ------------------------------------------------------------------------------
>
>                 Key: TIKA-451
>                 URL: https://issues.apache.org/jira/browse/TIKA-451
>             Project: Tika
>          Issue Type: Improvement
>          Components: metadata, parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Priority: Minor
>
> Currently, the PDF Parser does   calendar.getTime().toString()   which means dates end up in your local timezone, and are hard to parse
> The Open Document parsers output in iso 8601 format, which avoids these two problems
> The poi ole2 based parsers also output in date.toString() format, with the same timezone/parsing problems
> We should probably select one format, and update the parsers to all output in it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-451) Inconsistent date format for Metadata.CREATION_DATE and Metadata.LAST_MODIFIED

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12885671#action_12885671 ] 

Nick Burch commented on TIKA-451:
---------------------------------

All makes sense to me. I'll hopefully get around to making the changes to the Metadata object in the next couple of days, and once that's in rolling out the required changes to the parsers that don't currently output suitable dates.

> Inconsistent date format for Metadata.CREATION_DATE and Metadata.LAST_MODIFIED
> ------------------------------------------------------------------------------
>
>                 Key: TIKA-451
>                 URL: https://issues.apache.org/jira/browse/TIKA-451
>             Project: Tika
>          Issue Type: Improvement
>          Components: metadata, parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Priority: Minor
>
> Currently, the PDF Parser does   calendar.getTime().toString()   which means dates end up in your local timezone, and are hard to parse
> The Open Document parsers output in iso 8601 format, which avoids these two problems
> The poi ole2 based parsers also output in date.toString() format, with the same timezone/parsing problems
> We should probably select one format, and update the parsers to all output in it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-451) Inconsistent date format for Metadata.CREATION_DATE and Metadata.LAST_MODIFIED

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886119#action_12886119 ] 

Jukka Zitting commented on TIKA-451:
------------------------------------

I would only do property type checks in type-specific setters like setDate() or setInteger(). I'd allow the generic set() method with a string argument to always succeed. This avoids breaking the parsing of a document even if some of its metadata fields are malformed against our expectations.

Similarly I'd avoid throwing any exceptions from metadata getters. A malformed metadata value should probably be handled as if it was missing by the type-specific getters, and returned as-is by the generic get() method.


> Inconsistent date format for Metadata.CREATION_DATE and Metadata.LAST_MODIFIED
> ------------------------------------------------------------------------------
>
>                 Key: TIKA-451
>                 URL: https://issues.apache.org/jira/browse/TIKA-451
>             Project: Tika
>          Issue Type: Improvement
>          Components: metadata, parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>            Priority: Minor
>
> Currently, the PDF Parser does   calendar.getTime().toString()   which means dates end up in your local timezone, and are hard to parse
> The Open Document parsers output in iso 8601 format, which avoids these two problems
> The poi ole2 based parsers also output in date.toString() format, with the same timezone/parsing problems
> We should probably select one format, and update the parsers to all output in it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-451) Inconsistent date format for Metadata.CREATION_DATE and Metadata.LAST_MODIFIED

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886009#action_12886009 ] 

Nick Burch commented on TIKA-451:
---------------------------------

Anyone have a strong opinion on what to do in this case:
   metadata.set(Metadata.CREATION_DATE, "Last Thursday"); // Set a date property as a raw string 
   metadata.getDate(Metadata.CREATION_DATE); // Bang!

I'd lean towards throwing an illegal argument exception, but I guess we could maybe return null, or maybe create a new PropertyTypeViolationException (or better named!) and throw that?

> Inconsistent date format for Metadata.CREATION_DATE and Metadata.LAST_MODIFIED
> ------------------------------------------------------------------------------
>
>                 Key: TIKA-451
>                 URL: https://issues.apache.org/jira/browse/TIKA-451
>             Project: Tika
>          Issue Type: Improvement
>          Components: metadata, parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>            Priority: Minor
>
> Currently, the PDF Parser does   calendar.getTime().toString()   which means dates end up in your local timezone, and are hard to parse
> The Open Document parsers output in iso 8601 format, which avoids these two problems
> The poi ole2 based parsers also output in date.toString() format, with the same timezone/parsing problems
> We should probably select one format, and update the parsers to all output in it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (TIKA-451) Inconsistent date format for Metadata.CREATION_DATE and Metadata.LAST_MODIFIED

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Burch resolved TIKA-451.
-----------------------------

    Fix Version/s: 0.8
       Resolution: Fixed

I think all the key metadata keys are now defined as Date Properties, and all the main parsers are updated, so I believe this one is now resolved.

> Inconsistent date format for Metadata.CREATION_DATE and Metadata.LAST_MODIFIED
> ------------------------------------------------------------------------------
>
>                 Key: TIKA-451
>                 URL: https://issues.apache.org/jira/browse/TIKA-451
>             Project: Tika
>          Issue Type: Improvement
>          Components: metadata, parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>            Priority: Minor
>             Fix For: 0.8
>
>
> Currently, the PDF Parser does   calendar.getTime().toString()   which means dates end up in your local timezone, and are hard to parse
> The Open Document parsers output in iso 8601 format, which avoids these two problems
> The poi ole2 based parsers also output in date.toString() format, with the same timezone/parsing problems
> We should probably select one format, and update the parsers to all output in it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-451) Inconsistent date format for Metadata.CREATION_DATE and Metadata.LAST_MODIFIED

Posted by "Staffan Olsson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898794#action_12898794 ] 

Staffan Olsson commented on TIKA-451:
-------------------------------------

Converted DublinCore.DATE to Property.internalDate in http://github.com/solsson/tika/commit/2d637712053a758e7a6d5940c1a635615913056e
This affects parsers DcXML, Mbox, ooxml and image.

This patch makes use of refactoring I did to get better access to the Metadata Extractor API, for example to getDate(tagType). I'll post these changes as a new ticket shortly.

> Inconsistent date format for Metadata.CREATION_DATE and Metadata.LAST_MODIFIED
> ------------------------------------------------------------------------------
>
>                 Key: TIKA-451
>                 URL: https://issues.apache.org/jira/browse/TIKA-451
>             Project: Tika
>          Issue Type: Improvement
>          Components: metadata, parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>            Priority: Minor
>
> Currently, the PDF Parser does   calendar.getTime().toString()   which means dates end up in your local timezone, and are hard to parse
> The Open Document parsers output in iso 8601 format, which avoids these two problems
> The poi ole2 based parsers also output in date.toString() format, with the same timezone/parsing problems
> We should probably select one format, and update the parsers to all output in it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-451) Inconsistent date format for Metadata.CREATION_DATE and Metadata.LAST_MODIFIED

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886123#action_12886123 ] 

Nick Burch commented on TIKA-451:
---------------------------------

I'm +1 to Jukka's idea

> Inconsistent date format for Metadata.CREATION_DATE and Metadata.LAST_MODIFIED
> ------------------------------------------------------------------------------
>
>                 Key: TIKA-451
>                 URL: https://issues.apache.org/jira/browse/TIKA-451
>             Project: Tika
>          Issue Type: Improvement
>          Components: metadata, parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>            Priority: Minor
>
> Currently, the PDF Parser does   calendar.getTime().toString()   which means dates end up in your local timezone, and are hard to parse
> The Open Document parsers output in iso 8601 format, which avoids these two problems
> The poi ole2 based parsers also output in date.toString() format, with the same timezone/parsing problems
> We should probably select one format, and update the parsers to all output in it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-451) Inconsistent date format for Metadata.CREATION_DATE and Metadata.LAST_MODIFIED

Posted by "Staffan Olsson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894877#action_12894877 ] 

Staffan Olsson commented on TIKA-451:
-------------------------------------

Jpeg parser (TiffExtractor.handleCommonImageTags and JpegParserTest) has the same issue.

The test asserts for a date format that is not iso. The field's (DublinCore.DATE) javadoc says ISO 8601 so the test is clearly wrong. There is a "TODO Make me a Date Property" on it. I have code for parsing Metadata Extractor's date to ISO so I could fix this, but what field should we use? This issue discusses MSOffice.CREATION_DATE but I think DublinCore makes more sense for images. However Tika will be easier to use if there is only one creation date field.

> Inconsistent date format for Metadata.CREATION_DATE and Metadata.LAST_MODIFIED
> ------------------------------------------------------------------------------
>
>                 Key: TIKA-451
>                 URL: https://issues.apache.org/jira/browse/TIKA-451
>             Project: Tika
>          Issue Type: Improvement
>          Components: metadata, parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>            Priority: Minor
>
> Currently, the PDF Parser does   calendar.getTime().toString()   which means dates end up in your local timezone, and are hard to parse
> The Open Document parsers output in iso 8601 format, which avoids these two problems
> The poi ole2 based parsers also output in date.toString() format, with the same timezone/parsing problems
> We should probably select one format, and update the parsers to all output in it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.