You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Staffan Olsson (JIRA)" <ji...@apache.org> on 2010/08/17 10:15:28 UTC

[jira] Created: (TIKA-482) Refactor image and jpeg parsers for access to MetadataExtractor API

Refactor image and jpeg parsers for access to MetadataExtractor API
-------------------------------------------------------------------

                 Key: TIKA-482
                 URL: https://issues.apache.org/jira/browse/TIKA-482
             Project: Tika
          Issue Type: Improvement
          Components: parser
    Affects Versions: 0.7
            Reporter: Staffan Olsson


When I added support for more image metadata in TIKA-472, i realized
the current design had some restrictions:
 * I could not access the typed getters from Metadata Extractor, such
as getDate (to format iso date) and getStringArray (for keywords).
 * The handler function was called one field at a time which prevents
logic where one field depends on the value of another (there is for
example record versions and fields that specify encoding)

See attached patch. It refactors TiffExtractor to MetadataExtractorExtractor.
The patch also includes the date fix, see https://issues.apache.org/jira/browse/TIKA-451#action_12898794

We can later add more Extractors using other libraries, and map to parsers based on format. For example we already use ImageIO in ImageParser so maybe there should be an ImageIOExtractor. To support more image formats we could investigate XMP, for example using http://www.pkg.dk/projects/XMP-Utilities-for-Java-XMPUtil4J/.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-482) Refactor image and jpeg parsers for access to MetadataExtractor API

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906483#action_12906483 ] 

Nick Burch commented on TIKA-482:
---------------------------------

I couldn't include ImageMetadataExtractorTest as it uses new features of the extractor that weren't in the patch...

Looking at your latest git patch:
* I think we do need all the random metadata as-is, since that is all there has been for a while, and anyone currently using tika will be using those
* Could ExifOldStyleHandler and ExifHandler be merged? I guess ExifOldStyleHandler would want to be switched from tag iterator to directory.containsTag though?
* For the keywords, would it not be better to use the tika metadata multiple-value support, rather than underscore stuff?
* What else is needed do you think before we could apply this?

On the date thing, maybe the right thing to do is:
* EXIF original date -> Metadata.DATE, Metadata.CREATION_DATE
* EXIF date -> Metadata.LAST_MODIFIED
Would that make more sense to you?

> Refactor image and jpeg parsers for access to MetadataExtractor API
> -------------------------------------------------------------------
>
>                 Key: TIKA-482
>                 URL: https://issues.apache.org/jira/browse/TIKA-482
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Staffan Olsson
>         Attachments: TIKA-451-DublinCore_and_TIKA-482.patch
>
>
> When I added support for more image metadata in TIKA-472, i realized
> the current design had some restrictions:
>  * I could not access the typed getters from Metadata Extractor, such
> as getDate (to format iso date) and getStringArray (for keywords).
>  * The handler function was called one field at a time which prevents
> logic where one field depends on the value of another (there is for
> example record versions and fields that specify encoding)
> See attached patch. It refactors TiffExtractor to MetadataExtractorExtractor.
> The patch also includes the date fix, see https://issues.apache.org/jira/browse/TIKA-451#action_12898794
> We can later add more Extractors using other libraries, and map to parsers based on format. For example we already use ImageIO in ImageParser so maybe there should be an ImageIOExtractor.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-482) Refactor image and jpeg parsers for access to MetadataExtractor API

Posted by "Staffan Olsson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12922448#action_12922448 ] 

Staffan Olsson commented on TIKA-482:
-------------------------------------

I've had this running for quite a while now and it looks stable. If the patch is too complex I suggest (again) that we remove the non-namespaced EXIF fields. It will break backwards compatibility as you said Nick, but Tika is still 0.X and the names of EXIF fields differ between cameras anyway. I suggest copying all the fields with an "exif:" prefix instead, as there would be no risk for conflicts with defined fields and the raw output of all metadata fields would be clearer.

> Refactor image and jpeg parsers for access to MetadataExtractor API
> -------------------------------------------------------------------
>
>                 Key: TIKA-482
>                 URL: https://issues.apache.org/jira/browse/TIKA-482
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Staffan Olsson
>         Attachments: testJPEG_commented_pspcs2mac.jpg, testJPEG_commented_xnviewmp026.jpg, testTIFF.tif, TIKA-451-DublinCore_and_TIKA-482.patch, TIKA-482_exif_and_xmp.patch
>
>
> When I added support for more image metadata in TIKA-472, i realized
> the current design had some restrictions:
>  * I could not access the typed getters from Metadata Extractor, such
> as getDate (to format iso date) and getStringArray (for keywords).
>  * The handler function was called one field at a time which prevents
> logic where one field depends on the value of another (there is for
> example record versions and fields that specify encoding)
> See attached patch. It refactors TiffExtractor to MetadataExtractorExtractor.
> The patch also includes the date fix, see https://issues.apache.org/jira/browse/TIKA-451#action_12898794
> We can later add more Extractors using other libraries, and map to parsers based on format. For example we already use ImageIO in ImageParser so maybe there should be an ImageIOExtractor.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-482) Refactor image and jpeg parsers for access to MetadataExtractor API

Posted by "Staffan Olsson (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Staffan Olsson updated TIKA-482:
--------------------------------

    Attachment: TIKA-482_exif_and_xmp.patch
                testJPEG_commented_pspcs2mac.jpg
                testJPEG_commented_xnviewmp026.jpg

> Refactor image and jpeg parsers for access to MetadataExtractor API
> -------------------------------------------------------------------
>
>                 Key: TIKA-482
>                 URL: https://issues.apache.org/jira/browse/TIKA-482
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Staffan Olsson
>         Attachments: testJPEG_commented_pspcs2mac.jpg, testJPEG_commented_xnviewmp026.jpg, TIKA-451-DublinCore_and_TIKA-482.patch, TIKA-482_exif_and_xmp.patch
>
>
> When I added support for more image metadata in TIKA-472, i realized
> the current design had some restrictions:
>  * I could not access the typed getters from Metadata Extractor, such
> as getDate (to format iso date) and getStringArray (for keywords).
>  * The handler function was called one field at a time which prevents
> logic where one field depends on the value of another (there is for
> example record versions and fields that specify encoding)
> See attached patch. It refactors TiffExtractor to MetadataExtractorExtractor.
> The patch also includes the date fix, see https://issues.apache.org/jira/browse/TIKA-451#action_12898794
> We can later add more Extractors using other libraries, and map to parsers based on format. For example we already use ImageIO in ImageParser so maybe there should be an ImageIOExtractor.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-482) Refactor image and jpeg parsers for access to MetadataExtractor API

Posted by "Staffan Olsson (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Staffan Olsson updated TIKA-482:
--------------------------------

    Attachment: TIKA-451-DublinCore_and_TIKA-482.patch

Was developed as several commits at http://github.com/solsson/tika

> Refactor image and jpeg parsers for access to MetadataExtractor API
> -------------------------------------------------------------------
>
>                 Key: TIKA-482
>                 URL: https://issues.apache.org/jira/browse/TIKA-482
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Staffan Olsson
>         Attachments: TIKA-451-DublinCore_and_TIKA-482.patch
>
>
> When I added support for more image metadata in TIKA-472, i realized
> the current design had some restrictions:
>  * I could not access the typed getters from Metadata Extractor, such
> as getDate (to format iso date) and getStringArray (for keywords).
>  * The handler function was called one field at a time which prevents
> logic where one field depends on the value of another (there is for
> example record versions and fields that specify encoding)
> See attached patch. It refactors TiffExtractor to MetadataExtractorExtractor.
> The patch also includes the date fix, see https://issues.apache.org/jira/browse/TIKA-451#action_12898794
> We can later add more Extractors using other libraries, and map to parsers based on format. For example we already use ImageIO in ImageParser so maybe there should be an ImageIOExtractor. To support more image formats we could investigate XMP, for example using http://www.pkg.dk/projects/XMP-Utilities-for-Java-XMPUtil4J/.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-482) Refactor image and jpeg parsers for access to MetadataExtractor API

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916167#action_12916167 ] 

Nick Burch commented on TIKA-482:
---------------------------------

Staffan - do you think this is now stable enough to apply? If so, could you please upload a patch + tick the license grant box, and I'll review + hopefully commit!

> Refactor image and jpeg parsers for access to MetadataExtractor API
> -------------------------------------------------------------------
>
>                 Key: TIKA-482
>                 URL: https://issues.apache.org/jira/browse/TIKA-482
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Staffan Olsson
>         Attachments: TIKA-451-DublinCore_and_TIKA-482.patch
>
>
> When I added support for more image metadata in TIKA-472, i realized
> the current design had some restrictions:
>  * I could not access the typed getters from Metadata Extractor, such
> as getDate (to format iso date) and getStringArray (for keywords).
>  * The handler function was called one field at a time which prevents
> logic where one field depends on the value of another (there is for
> example record versions and fields that specify encoding)
> See attached patch. It refactors TiffExtractor to MetadataExtractorExtractor.
> The patch also includes the date fix, see https://issues.apache.org/jira/browse/TIKA-451#action_12898794
> We can later add more Extractors using other libraries, and map to parsers based on format. For example we already use ImageIO in ImageParser so maybe there should be an ImageIOExtractor.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-482) Refactor image and jpeg parsers for access to MetadataExtractor API

Posted by "Staffan Olsson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906370#action_12906370 ] 

Staffan Olsson commented on TIKA-482:
-------------------------------------

I see the difference in date now:

         assertEquals("Date/Time Original for when the photo was taken, unspecified time zone",
-                "2009-08-11T09:09:45", metadata.get(Metadata.ORIGINAL_DATE));
+                "2009-08-11T09:09:45", metadata.get(Metadata.DATE));
         assertEquals("This image has different Date/Time than Date/Time Original, so it is probably modification date",
-                "2009-10-02T23:02:49", metadata.get(Metadata.DATE));
+                "2009-10-02T23:02:49", metadata.get(Metadata.LAST_MODIFIED));

I don't agree to this change, because the javadocs of DublinCore.DATE say "Typically, Date will be associated with the creation or availability of the resource".

> Refactor image and jpeg parsers for access to MetadataExtractor API
> -------------------------------------------------------------------
>
>                 Key: TIKA-482
>                 URL: https://issues.apache.org/jira/browse/TIKA-482
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Staffan Olsson
>         Attachments: TIKA-451-DublinCore_and_TIKA-482.patch
>
>
> When I added support for more image metadata in TIKA-472, i realized
> the current design had some restrictions:
>  * I could not access the typed getters from Metadata Extractor, such
> as getDate (to format iso date) and getStringArray (for keywords).
>  * The handler function was called one field at a time which prevents
> logic where one field depends on the value of another (there is for
> example record versions and fields that specify encoding)
> See attached patch. It refactors TiffExtractor to MetadataExtractorExtractor.
> The patch also includes the date fix, see https://issues.apache.org/jira/browse/TIKA-451#action_12898794
> We can later add more Extractors using other libraries, and map to parsers based on format. For example we already use ImageIO in ImageParser so maybe there should be an ImageIOExtractor.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-482) Refactor image and jpeg parsers for access to MetadataExtractor API

Posted by "Staffan Olsson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916481#action_12916481 ] 

Staffan Olsson commented on TIKA-482:
-------------------------------------

Attached a combined svn patch for the fixes above.
 * Added XMP parsing of Title, Subject and Description with encoding support
 * Date extraction to Dublin Core fields matches javadoc for the fields
 * All EXIF parsing done the same way
 * EXIF tags that are mapped to fields of the same name unless the name is a defined Tika field (for backwards compatibility -- they should have been added with a namespace such as "exif:" from the start)
 * multi-value keywords

Questions:
 * Reading the same input stream twice in JpegParser and TiffParser, is this how to do it in Tika?
 * Finding out if a string is a defined Tika field, is the new class necessary?
 * What if the parsing of an exif field throws exception, should we attempt to extract the remaining fields anyway?


> Refactor image and jpeg parsers for access to MetadataExtractor API
> -------------------------------------------------------------------
>
>                 Key: TIKA-482
>                 URL: https://issues.apache.org/jira/browse/TIKA-482
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Staffan Olsson
>         Attachments: testJPEG_commented_pspcs2mac.jpg, testJPEG_commented_xnviewmp026.jpg, TIKA-451-DublinCore_and_TIKA-482.patch, TIKA-482_exif_and_xmp.patch
>
>
> When I added support for more image metadata in TIKA-472, i realized
> the current design had some restrictions:
>  * I could not access the typed getters from Metadata Extractor, such
> as getDate (to format iso date) and getStringArray (for keywords).
>  * The handler function was called one field at a time which prevents
> logic where one field depends on the value of another (there is for
> example record versions and fields that specify encoding)
> See attached patch. It refactors TiffExtractor to MetadataExtractorExtractor.
> The patch also includes the date fix, see https://issues.apache.org/jira/browse/TIKA-451#action_12898794
> We can later add more Extractors using other libraries, and map to parsers based on format. For example we already use ImageIO in ImageParser so maybe there should be an ImageIOExtractor.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-482) Refactor image and jpeg parsers for access to MetadataExtractor API

Posted by "Staffan Olsson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906375#action_12906375 ] 

Staffan Olsson commented on TIKA-482:
-------------------------------------

Tried to merge both efforts in
http://github.com/solsson/tika/commit/9aaff18305abf352b355b1ef8de753bdc2e6b5b5

Merging was difficult because my patch was in the same commit as some additional changes, so I had to manually extract those changes into my fork.

> Refactor image and jpeg parsers for access to MetadataExtractor API
> -------------------------------------------------------------------
>
>                 Key: TIKA-482
>                 URL: https://issues.apache.org/jira/browse/TIKA-482
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Staffan Olsson
>         Attachments: TIKA-451-DublinCore_and_TIKA-482.patch
>
>
> When I added support for more image metadata in TIKA-472, i realized
> the current design had some restrictions:
>  * I could not access the typed getters from Metadata Extractor, such
> as getDate (to format iso date) and getStringArray (for keywords).
>  * The handler function was called one field at a time which prevents
> logic where one field depends on the value of another (there is for
> example record versions and fields that specify encoding)
> See attached patch. It refactors TiffExtractor to MetadataExtractorExtractor.
> The patch also includes the date fix, see https://issues.apache.org/jira/browse/TIKA-451#action_12898794
> We can later add more Extractors using other libraries, and map to parsers based on format. For example we already use ImageIO in ImageParser so maybe there should be an ImageIOExtractor.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-482) Refactor image and jpeg parsers for access to MetadataExtractor API

Posted by "Staffan Olsson (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Staffan Olsson updated TIKA-482:
--------------------------------

    Comment: was deleted

(was: Noticed that we already have jempbox in Tika so a JempboxExtractor in the image package would probably be the best approach to reading XMP. I'll make a separate ticket for this.)

> Refactor image and jpeg parsers for access to MetadataExtractor API
> -------------------------------------------------------------------
>
>                 Key: TIKA-482
>                 URL: https://issues.apache.org/jira/browse/TIKA-482
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Staffan Olsson
>         Attachments: TIKA-451-DublinCore_and_TIKA-482.patch
>
>
> When I added support for more image metadata in TIKA-472, i realized
> the current design had some restrictions:
>  * I could not access the typed getters from Metadata Extractor, such
> as getDate (to format iso date) and getStringArray (for keywords).
>  * The handler function was called one field at a time which prevents
> logic where one field depends on the value of another (there is for
> example record versions and fields that specify encoding)
> See attached patch. It refactors TiffExtractor to MetadataExtractorExtractor.
> The patch also includes the date fix, see https://issues.apache.org/jira/browse/TIKA-451#action_12898794
> We can later add more Extractors using other libraries, and map to parsers based on format. For example we already use ImageIO in ImageParser so maybe there should be an ImageIOExtractor.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-482) Refactor image and jpeg parsers for access to MetadataExtractor API

Posted by "Staffan Olsson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12907386#action_12907386 ] 

Staffan Olsson commented on TIKA-482:
-------------------------------------

Copy all fields are back, almost... Solved the backwards compatibility issue in
http://github.com/solsson/tika/commit/770a453acff9a490c27c6a2fd01828c7ddd5fde1

> Refactor image and jpeg parsers for access to MetadataExtractor API
> -------------------------------------------------------------------
>
>                 Key: TIKA-482
>                 URL: https://issues.apache.org/jira/browse/TIKA-482
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Staffan Olsson
>         Attachments: TIKA-451-DublinCore_and_TIKA-482.patch
>
>
> When I added support for more image metadata in TIKA-472, i realized
> the current design had some restrictions:
>  * I could not access the typed getters from Metadata Extractor, such
> as getDate (to format iso date) and getStringArray (for keywords).
>  * The handler function was called one field at a time which prevents
> logic where one field depends on the value of another (there is for
> example record versions and fields that specify encoding)
> See attached patch. It refactors TiffExtractor to MetadataExtractorExtractor.
> The patch also includes the date fix, see https://issues.apache.org/jira/browse/TIKA-451#action_12898794
> We can later add more Extractors using other libraries, and map to parsers based on format. For example we already use ImageIO in ImageParser so maybe there should be an ImageIOExtractor.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-482) Refactor image and jpeg parsers for access to MetadataExtractor API

Posted by "Staffan Olsson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906352#action_12906352 ] 

Staffan Olsson commented on TIKA-482:
-------------------------------------

I've looked through your commits briefly but haven't had time to merge it into my branch yet.

I couldn't spot the difference in Exif Date/Time, what is it?

A test seems to be missing, namely http://github.com/solsson/tika/blob/trunk/tika-parsers/src/test/java/org/apache/tika/parser/image/ImageMetadataExtractorTest.java

I'm working on adding real XMP extraction support for images, but waiting for https://issues.apache.org/jira/browse/PDFBOX-806. 

> Refactor image and jpeg parsers for access to MetadataExtractor API
> -------------------------------------------------------------------
>
>                 Key: TIKA-482
>                 URL: https://issues.apache.org/jira/browse/TIKA-482
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Staffan Olsson
>         Attachments: TIKA-451-DublinCore_and_TIKA-482.patch
>
>
> When I added support for more image metadata in TIKA-472, i realized
> the current design had some restrictions:
>  * I could not access the typed getters from Metadata Extractor, such
> as getDate (to format iso date) and getStringArray (for keywords).
>  * The handler function was called one field at a time which prevents
> logic where one field depends on the value of another (there is for
> example record versions and fields that specify encoding)
> See attached patch. It refactors TiffExtractor to MetadataExtractorExtractor.
> The patch also includes the date fix, see https://issues.apache.org/jira/browse/TIKA-451#action_12898794
> We can later add more Extractors using other libraries, and map to parsers based on format. For example we already use ImageIO in ImageParser so maybe there should be an ImageIOExtractor.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-482) Refactor image and jpeg parsers for access to MetadataExtractor API

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12922535#action_12922535 ] 

Nick Burch commented on TIKA-482:
---------------------------------

Sorry for the delay Staffan. I've done a quick review and it looked fine, and I'm hoping to do a longer review + commit early next week (it's been a crazy busy fortnight...)

> Refactor image and jpeg parsers for access to MetadataExtractor API
> -------------------------------------------------------------------
>
>                 Key: TIKA-482
>                 URL: https://issues.apache.org/jira/browse/TIKA-482
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Staffan Olsson
>         Attachments: testJPEG_commented_pspcs2mac.jpg, testJPEG_commented_xnviewmp026.jpg, testTIFF.tif, TIKA-451-DublinCore_and_TIKA-482.patch, TIKA-482_exif_and_xmp.patch
>
>
> When I added support for more image metadata in TIKA-472, i realized
> the current design had some restrictions:
>  * I could not access the typed getters from Metadata Extractor, such
> as getDate (to format iso date) and getStringArray (for keywords).
>  * The handler function was called one field at a time which prevents
> logic where one field depends on the value of another (there is for
> example record versions and fields that specify encoding)
> See attached patch. It refactors TiffExtractor to MetadataExtractorExtractor.
> The patch also includes the date fix, see https://issues.apache.org/jira/browse/TIKA-451#action_12898794
> We can later add more Extractors using other libraries, and map to parsers based on format. For example we already use ImageIO in ImageParser so maybe there should be an ImageIOExtractor.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (TIKA-482) Refactor image and jpeg parsers for access to MetadataExtractor API

Posted by "Staffan Olsson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906370#action_12906370 ] 

Staffan Olsson edited comment on TIKA-482 at 9/6/10 12:42 AM:
--------------------------------------------------------------

I see the difference in date now:

         assertEquals("Date/Time Original for when the photo was taken, unspecified time zone",
-                "2009-08-11T09:09:45", metadata.get(Metadata.DATE));
+                "2009-08-11T09:09:45", metadata.get(Metadata.ORIGINAL_DATE));
         assertEquals("This image has different Date/Time than Date/Time Original, so it is probably modification date",
-                "2009-10-02T23:02:49", metadata.get(Metadata.LAST_MODIFIED));
+                "2009-10-02T23:02:49", metadata.get(Metadata.DATE));

I don't agree to this change, because the javadocs of DublinCore.DATE say "Typically, Date will be associated with the creation or availability of the resource".

      was (Author: solsson):
    I see the difference in date now:

         assertEquals("Date/Time Original for when the photo was taken, unspecified time zone",
-                "2009-08-11T09:09:45", metadata.get(Metadata.ORIGINAL_DATE));
+                "2009-08-11T09:09:45", metadata.get(Metadata.DATE));
         assertEquals("This image has different Date/Time than Date/Time Original, so it is probably modification date",
-                "2009-10-02T23:02:49", metadata.get(Metadata.DATE));
+                "2009-10-02T23:02:49", metadata.get(Metadata.LAST_MODIFIED));

I don't agree to this change, because the javadocs of DublinCore.DATE say "Typically, Date will be associated with the creation or availability of the resource".
  
> Refactor image and jpeg parsers for access to MetadataExtractor API
> -------------------------------------------------------------------
>
>                 Key: TIKA-482
>                 URL: https://issues.apache.org/jira/browse/TIKA-482
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Staffan Olsson
>         Attachments: TIKA-451-DublinCore_and_TIKA-482.patch
>
>
> When I added support for more image metadata in TIKA-472, i realized
> the current design had some restrictions:
>  * I could not access the typed getters from Metadata Extractor, such
> as getDate (to format iso date) and getStringArray (for keywords).
>  * The handler function was called one field at a time which prevents
> logic where one field depends on the value of another (there is for
> example record versions and fields that specify encoding)
> See attached patch. It refactors TiffExtractor to MetadataExtractorExtractor.
> The patch also includes the date fix, see https://issues.apache.org/jira/browse/TIKA-451#action_12898794
> We can later add more Extractors using other libraries, and map to parsers based on format. For example we already use ImageIO in ImageParser so maybe there should be an ImageIOExtractor.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-482) Refactor image and jpeg parsers for access to MetadataExtractor API

Posted by "Staffan Olsson (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Staffan Olsson updated TIKA-482:
--------------------------------

    Attachment: testTIFF.tif

New version of testTIFF needed for the added tests. Metadata diff:
9c9
< Exif.Image.StripOffsets                      Long        2  8 19208
---
> Exif.Image.StripOffsets                      Long        2  3084 22284
17a18,19
> Exif.Image.XMLPacket                         Byte      2500  (Binary value suppressed)
> Xmp.dc.subject                               XmpBag      2  cat, garden


> Refactor image and jpeg parsers for access to MetadataExtractor API
> -------------------------------------------------------------------
>
>                 Key: TIKA-482
>                 URL: https://issues.apache.org/jira/browse/TIKA-482
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Staffan Olsson
>         Attachments: testJPEG_commented_pspcs2mac.jpg, testJPEG_commented_xnviewmp026.jpg, testTIFF.tif, TIKA-451-DublinCore_and_TIKA-482.patch, TIKA-482_exif_and_xmp.patch
>
>
> When I added support for more image metadata in TIKA-472, i realized
> the current design had some restrictions:
>  * I could not access the typed getters from Metadata Extractor, such
> as getDate (to format iso date) and getStringArray (for keywords).
>  * The handler function was called one field at a time which prevents
> logic where one field depends on the value of another (there is for
> example record versions and fields that specify encoding)
> See attached patch. It refactors TiffExtractor to MetadataExtractorExtractor.
> The patch also includes the date fix, see https://issues.apache.org/jira/browse/TIKA-451#action_12898794
> We can later add more Extractors using other libraries, and map to parsers based on format. For example we already use ImageIO in ImageParser so maybe there should be an ImageIOExtractor.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-482) Refactor image and jpeg parsers for access to MetadataExtractor API

Posted by "Staffan Olsson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12907097#action_12907097 ] 

Staffan Olsson commented on TIKA-482:
-------------------------------------

Merged EXIF parsing so now all fields are processed the same way, in:
http://github.com/solsson/tika/commit/e61048c177560b8aa5585fd5b1d9194f446bec65
and some minor additions in:
http://github.com/solsson/tika/commit/c22107201178064ffd2260ca9136cb0d57c46d1f
http://github.com/solsson/tika/commit/880c0eb11f8410296d9b1401afef2dc37abbaf24

Set dates as Nick suggested:
http://github.com/solsson/tika/commit/866d396497dd7b95329f465d8cb220ad2899dc8b

Handling keywords as multi-value since:
http://github.com/solsson/tika/commit/9742c826a5edad6d0288b83d3653735dd85b116f
Note:
 * Assertions for "subject" field and unicode characters in description may need to be commented out until XMP support is merged.
 * This commit disables the copying of all fields for reasons stated in the commit comment.
Can it be done like in PDFParser, with only the fields that are not explicitly mapped?



> Refactor image and jpeg parsers for access to MetadataExtractor API
> -------------------------------------------------------------------
>
>                 Key: TIKA-482
>                 URL: https://issues.apache.org/jira/browse/TIKA-482
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Staffan Olsson
>         Attachments: TIKA-451-DublinCore_and_TIKA-482.patch
>
>
> When I added support for more image metadata in TIKA-472, i realized
> the current design had some restrictions:
>  * I could not access the typed getters from Metadata Extractor, such
> as getDate (to format iso date) and getStringArray (for keywords).
>  * The handler function was called one field at a time which prevents
> logic where one field depends on the value of another (there is for
> example record versions and fields that specify encoding)
> See attached patch. It refactors TiffExtractor to MetadataExtractorExtractor.
> The patch also includes the date fix, see https://issues.apache.org/jira/browse/TIKA-451#action_12898794
> We can later add more Extractors using other libraries, and map to parsers based on format. For example we already use ImageIO in ImageParser so maybe there should be an ImageIOExtractor.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-482) Refactor image and jpeg parsers for access to MetadataExtractor API

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905930#action_12905930 ] 

Nick Burch commented on TIKA-482:
---------------------------------

Thanks for this patch

I've applied it with a few tweaks in r992319.

The two main changes were:
* Different name for the Exif parser class - ImageMetadataExtractor seemed a better name than MetadataExtractorExtractor
* Original and default dates done slightly differently. This was with TIKA-504 in mind, but we should maybe think about which is the right set of date related properties to map onto

I'll keep this issue open for now, as it looks from your Git repo that you've some more cool new refactorings to come along shortly!

> Refactor image and jpeg parsers for access to MetadataExtractor API
> -------------------------------------------------------------------
>
>                 Key: TIKA-482
>                 URL: https://issues.apache.org/jira/browse/TIKA-482
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Staffan Olsson
>         Attachments: TIKA-451-DublinCore_and_TIKA-482.patch
>
>
> When I added support for more image metadata in TIKA-472, i realized
> the current design had some restrictions:
>  * I could not access the typed getters from Metadata Extractor, such
> as getDate (to format iso date) and getStringArray (for keywords).
>  * The handler function was called one field at a time which prevents
> logic where one field depends on the value of another (there is for
> example record versions and fields that specify encoding)
> See attached patch. It refactors TiffExtractor to MetadataExtractorExtractor.
> The patch also includes the date fix, see https://issues.apache.org/jira/browse/TIKA-451#action_12898794
> We can later add more Extractors using other libraries, and map to parsers based on format. For example we already use ImageIO in ImageParser so maybe there should be an ImageIOExtractor.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-482) Refactor image and jpeg parsers for access to MetadataExtractor API

Posted by "Staffan Olsson (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Staffan Olsson updated TIKA-482:
--------------------------------

    Description: 
When I added support for more image metadata in TIKA-472, i realized
the current design had some restrictions:
 * I could not access the typed getters from Metadata Extractor, such
as getDate (to format iso date) and getStringArray (for keywords).
 * The handler function was called one field at a time which prevents
logic where one field depends on the value of another (there is for
example record versions and fields that specify encoding)

See attached patch. It refactors TiffExtractor to MetadataExtractorExtractor.
The patch also includes the date fix, see https://issues.apache.org/jira/browse/TIKA-451#action_12898794

We can later add more Extractors using other libraries, and map to parsers based on format. For example we already use ImageIO in ImageParser so maybe there should be an ImageIOExtractor.

  was:
When I added support for more image metadata in TIKA-472, i realized
the current design had some restrictions:
 * I could not access the typed getters from Metadata Extractor, such
as getDate (to format iso date) and getStringArray (for keywords).
 * The handler function was called one field at a time which prevents
logic where one field depends on the value of another (there is for
example record versions and fields that specify encoding)

See attached patch. It refactors TiffExtractor to MetadataExtractorExtractor.
The patch also includes the date fix, see https://issues.apache.org/jira/browse/TIKA-451#action_12898794

We can later add more Extractors using other libraries, and map to parsers based on format. For example we already use ImageIO in ImageParser so maybe there should be an ImageIOExtractor. To support more image formats we could investigate XMP, for example using http://www.pkg.dk/projects/XMP-Utilities-for-Java-XMPUtil4J/.


Noticed that we already have jempbox in Tika so a JempboxExtractor in the image package would probably be the best approach to reading XMP. I'll make a separate ticket for this.

> Refactor image and jpeg parsers for access to MetadataExtractor API
> -------------------------------------------------------------------
>
>                 Key: TIKA-482
>                 URL: https://issues.apache.org/jira/browse/TIKA-482
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Staffan Olsson
>         Attachments: TIKA-451-DublinCore_and_TIKA-482.patch
>
>
> When I added support for more image metadata in TIKA-472, i realized
> the current design had some restrictions:
>  * I could not access the typed getters from Metadata Extractor, such
> as getDate (to format iso date) and getStringArray (for keywords).
>  * The handler function was called one field at a time which prevents
> logic where one field depends on the value of another (there is for
> example record versions and fields that specify encoding)
> See attached patch. It refactors TiffExtractor to MetadataExtractorExtractor.
> The patch also includes the date fix, see https://issues.apache.org/jira/browse/TIKA-451#action_12898794
> We can later add more Extractors using other libraries, and map to parsers based on format. For example we already use ImageIO in ImageParser so maybe there should be an ImageIOExtractor.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.