You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Staffan <so...@gmail.com> on 2010/08/16 09:03:00 UTC

Refactoring image and jpeg parsers

Hi,

When I added support for more image metadata in TIKA-472, i realized
the current design had some restrictions:
 * I could not access the typed getters from Metadata Extractor, such
as getDate (to format iso date) and getStringArray (for keywords).
 * The handler function was called one field at a time which prevents
logic where one field depends on the value of another (there is for
example record versions and fields that specify encoding)

I also think it would be clearer if a Parser is per file format and an
Extractor is per library used.

I refactored TiffExtractor to MetadataExtractorExtractor. We also use
ImageIO in the tiff parser so maybe there should be such an extractor
too. I'm also looking for an XMP library in java so we can have an
extractor for those fields from all kinds of images including adobe
programs.
This refactoring allowed me to get dates properly, see somment in
https://issues.apache.org/jira/browse/TIKA-451.

Current version of the class
http://github.com/solsson/tika/blob/b25218ed728b727bea71b0799c358f78d6df8c08/tika-parsers/src/main/java/org/apache/tika/parser/image/MetadataExtractorExtractor.java
The tests pass pretty much unchanged.

Should I create a patch and a ticket for this?

/Staffan