You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "John Gibson (JIRA)" <ji...@apache.org> on 2014/07/15 18:05:04 UTC

[jira] [Created] (TIKA-1369) Date parsing and thread safety in ImageMetadataExtractor

John Gibson created TIKA-1369:
---------------------------------

             Summary: Date parsing and thread safety in ImageMetadataExtractor
                 Key: TIKA-1369
                 URL: https://issues.apache.org/jira/browse/TIKA-1369
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.5
         Environment: OS X 10.9.4 Java 7_60
            Reporter: John Gibson
            Priority: Critical


The {{ImageMetadataExtractor}} uses a static instance of {{SimpleDateFormat}}.  This is not thread safe.
{code:title=ImageMetadataExtractor.java}
    static class ExifHandler implements DirectoryHandler {
        private static final SimpleDateFormat DATE_UNSPECIFIED_TZ = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss");
        ...
        public void handleDateTags(Directory directory, Metadata metadata)
                throws MetadataException {
            // Date/Time Original overrides value from ExifDirectory.TAG_DATETIME
            Date original = null;
            if (directory.containsTag(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL)) {
                original = directory.getDate(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL);
                // Unless we have GPS time we don't know the time zone so date must be set
                // as ISO 8601 datetime without timezone suffix (no Z or +/-)
                if (original != null) {
                    String datetimeNoTimeZone = DATE_UNSPECIFIED_TZ.format(original); // Same time zone as Metadata Extractor uses
                    metadata.set(TikaCoreProperties.CREATED, datetimeNoTimeZone);
                    metadata.set(Metadata.ORIGINAL_DATE, datetimeNoTimeZone);
                }
            }
       ...
{code}

This is not the first time that SDF has caused problems: TIKA-495, TIKA-864. In the discussion there the idea of using alternative thread-safe (and faster) formatters from either Joda time or Commons Lang were dismissed because they would add too many dependencies. Given that Tika already has a fairly large laundry list of dependencies to parse content, adding one more JAR to make sure things don't break is probably a good idea.

In addition, because no timezone or locale are specified by either Tika's formatter or the call to com.drew.metadata.Directory it can wreak havok during randomized testing. Given that the timezone is unknown, why not just default it to UTC and let the caller guess the timezone? As it stands I have to reparse all of the dates into UTC to get stable behavior across timezones.



--
This message was sent by Atlassian JIRA
(v6.2#6252)