You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by Joerg Ehrlich <je...@adobe.com> on 2012/04/05 14:58:54 UTC

Metadata situation and XMP support in Tika

Hi everyone,

I am an engineer in the XMP/Metadata team at Adobe and we would like to leverage Tika in current projects for metadata extraction (and mimetype detection).
Our current systems primarily use the XMP data model to manage and interact with metadata.
As far as I can see, the support for the XMP data model and also for standard metadata schemas/namespaces (like IPTC, Exif, etc.) in Tika is pretty suboptimal as of today.
But instead of wrapping Tika in own layers of code in our systems, we feel that it would be more useful to contribute to the project instead going forward.

I have had a deeper look in Tika and how to improve the metadata/XMP output of it.
I saw that you have a bug for XMP already (TIKA-756), which I would probably use to submit any patches related to that.
But I am currently unsure what the best approach would be to do the mapping to XMP and I would like to hear your opinion on it before starting any work.

Let me quickly summarize if I have understood the basic metadata concept correctly:

1.       Each parser fills a Metadata map which is a simple key-value list where values can also be multi-values

2.       Mostly the keys for the Metadata map are taken from fixed lists which are defined as interfaces in the Metadata class

3.       Those keys are usually Property objects, where the Property class also serves as a static list which registers every property that is created in the Metadata interfaces. This Property class resembles the XMP data model to some extend but does not store e.g. any hierarchical information. And it leaves every client the choice to store property names with prefixes or not.

4.       Any metadata outputter just iterates over the Metadata map and could query the Property list for additional information.

5.       In case of the XMP outputter (XMPContentHandler) only those properties are outputted which are stored with a prefix in the Property list.


I see two potential ways to improve the situation:


1.       Have a fixed mapping table for each mime type which would be used in XMPContentHandler to map from the Metadata map to the XMP data model. Such mapping tables would be pretty ugly as each parser produces different metadata maps and there is no consistent way to handle them. This option would be least invasive for other clients of Tika but would also be a real hack and would not really improve the metadata situation in Tika in general.

2.       Try to improve the Key interface lists of Metadata class and adjust all parsers accordingly. This could be done by adding new keys with prefixes and keeping/deprecating the existing ones to not disturb existing clients. Similar to what is proposed for the DublinCore namespace in TIKA-859 and TIKA-842.
This would be more invasive but would offer the opportunity to really improve the metadata situation. I already saw a couple of places in the code that clearly break existing standards. But there are also examples where mapping might have to be done to different properties at the same time: If you look at the mapping of GPS data from Exif, this is currently mapped to W3C vocabulary in Tika. But in XMP this mapping is defined otherwise [1] by CIPA (the EXIF standardization committee). So probably both mappings have to be supported.

I personally would prefer option two. What do you think?
Looking forward to working with you guys.
Regards
Jörg

[1] http://www.cipa.jp/english/hyoujunka/kikaku/cipa_e_kikaku_list.html

RE: Metadata situation and XMP support in Tika

Posted by Joerg Ehrlich <je...@adobe.com>.

Ups, forgot the links...

-----Original Message-----

Hi Nick,

Yes, I agree that Tika should support a unifying access to common metadata properties like title, description, keywords, creator, rating, etc. So there should be a clear semantic for those common properties regardless of the underlying implementation in various metadata containers. And the access to these properties can be or should be as simple as "Metadata.title".
On the other hand, if you think about Tika being used in business workflow where clients really care about the underlying semantic and file format specific metadata, you might need something more powerful and flexible to access and manage metadata. 
And I also agree that the latter should be possible without sacrificing the first. 

On a side note:
While the idea of "Someone who understands the format works out how to map the file format's metadata onto a common set" is very compelling and is easy to do, in reality this can get very complicated. And if people have big business depending on such mappings, they tend to have different opinions about what the right way is. That's why we have organizations like the "Metadata Working Group" [1] or the W3C "Media Annotation Working Group" [2] trying to clean up the mess that has evolved over the last decades in this area.
And the moment you start writing metadata back into files, you will also start running in all sorts of complications when you have done too much simplification in the read case. But that is no problem for Tika, right now. 

I agree with Ray that the current implementation can support both approaches to make metadata accessible.
While the metadata map can be used to offer easy access to the common set of properties, an XMP output could be used to offer a more extensive, flexible and semantically clearer access to a file's metadata.
I agree with Ray that the common set of keys in the Metadata map should inherit/alias from well known, standard namespaces like Dublin Core. That's why I said the Tika parsers should read metadata using the standard namespaces and properties. This would also make the mapping in the parsers more clearer for developers that want to change something. Currently you always have to guess where something is mapped to.
In general, I'd recommend Dublin Core and the semantic of the ISO part of XMP - which builds on top DC - for common and file format neutral Tika properties that are offered to clients.
And I agree with Ray that having all metadata interfaces be part of the Metadata class is more confusing than helpful for clients.

I am about to put an architectural metadata roadmap on the Tika Wiki for further discussion.
There I want to illustrate a couple of ideas I have also been discussing with Jukka so far and the steps we see on a roadmap that should help us to improve the metadata situation for Tika.

Regards
Jörg 

[1] http://metadataworkinggroup.com/specs/
[2] http://www.w3.org/TR/2012/REC-mediaont-10-20120209/

-----Original Message-----
From: Ray Gauss II [mailto:ray.gauss@alfresco.com]
Sent: Dienstag, 24. April 2012 15:10
To: dev@tika.apache.org
Subject: Re: Metadata situation and XMP support in Tika

I think the aliasing approach supports both use cases nicely, i.e.:

Metatadata.java:
...
   Property TITLE = DublinCore.DC_TITLE; ...

Users then only have to concern themselves with "give me the metadata that best fits the idea of Title, as defined by Tika", and not even have to know about DublinCore, but can dig into details of the implementation as needed.

This separation is less of a concern in the particular case of DublinCore since it is such as basic, broad, and widely accepted standard, but for other standards that direct inclusion in the Metadata interface makes less sense.  For example, at the moment we're essentially asking users to say "give me the metadata that best fits the idea of Keywords, as defined by MSOffice" which doesn't make a lot of sense when dealing with something like images.  If we aliased:

Metatadata.java:
...
   Property KEYWORDS = MSOffice.MS_KEYWORDS; ...

we're back to the intended "give me the metadata that best fits the idea of Keywords, as defined by Tika".  In this case, DublinCore.DC_SUBJECT is probably a much better standard to alias keywords from than MSOffice, but I'm just sticking to the current mappings for this example.

Ray

On Apr 24, 2012, at 7:43 AM, Nick Burch wrote:

> On Fri, 13 Apr 2012, Joerg Ehrlich wrote:
>> I think it would be more clear if parsers/clients would use the namespace or standard properties explicitly instead of the metadata one. But your idea of having a set of "standard" properties available in the Metadata class would be a good help for clients who don't care which "title" or "author" they read. They could just say "Metadata.title" instead of "DublinCore.title".
> 
> One thing to bear in mind is that we've tried to hide the differences in format's metadata from end users of Tika. You shouldn't need to know if a format calls it "description" or "subject" or "title" or "dc:title" or "WhatItsAllAbout". Someone who understands the format works out how to map the file format's metadata onto a common set. End users can then say "give me the metadata that best fits the idea of Title, as defined by Dublin Core" and they get something back. The intricacies of the file formats are hidden from them, they get clean and consistent metadata back.
> 
> I certainly see there are cases when someone may want the full set of 
> metadata back from a file, in quite a low level way, but we should 
> make sure we don't loose the ability of users to say "give me the 
> title of that document, no matter what the format stores it as" that 
> we currently have
> 
> Nick

RE: Metadata situation and XMP support in Tika

Posted by Joerg Ehrlich <je...@adobe.com>.

Hi Nick,

Yes, I agree that Tika should support a unifying access to common metadata properties like title, description, keywords, creator, rating, etc. So there should be a clear semantic for those common properties regardless of the underlying implementation in various metadata containers. And the access to these properties can be or should be as simple as "Metadata.title".
On the other hand, if you think about Tika being used in business workflow where clients really care about the underlying semantic and file format specific metadata, you might need something more powerful and flexible to access and manage metadata. 
And I also agree that the latter should be possible without sacrificing the first. 

On a side note:
While the idea of "Someone who understands the format works out how to map the file format's metadata onto a common set" is very compelling and is easy to do, in reality this can get very complicated. And if people have big business depending on such mappings, they tend to have different opinions about what the right way is. That's why we have organizations like the "Metadata Working Group" [1] or the W3C trying to clean up the mess that has evolved over the last decades in this area.
And the moment you start writing metadata back into files, you will also start running in all sorts of complications when you have done too much simplification in the read case. But that is no problem for Tika, right now. 

I agree with Ray that the current implementation can support both approaches to make metadata accessible.
While the metadata map can be used to offer easy access to the common set of properties, an XMP output could be used to offer a more extensive, flexible and semantically clearer access to a file's metadata.
I agree with Ray that the common set of keys in the Metadata map should inherit/alias from well known, standard namespaces like Dublin Core. That's why I said the Tika parsers should read metadata using the standard namespaces and properties. This would also make the mapping in the parsers more clearer for developers that want to change something. Currently you always have to guess where something is mapped to.
In general, I'd recommend Dublin Core and the semantic of the ISO part of XMP - which builds on top DC - for common and file format neutral Tika properties that are offered to clients.
And I agree with Ray that having all metadata interfaces be part of the Metadata class is more confusing than helpful for clients.

I am about to put an architectural metadata roadmap on the Tika Wiki for further discussion.
There I want to illustrate a couple of ideas I have also been discussing with Jukka so far and the steps we see on a roadmap that should help us to improve the metadata situation for Tika.

Regards
Jörg 

-----Original Message-----
From: Ray Gauss II [mailto:ray.gauss@alfresco.com] 
Sent: Dienstag, 24. April 2012 15:10
To: dev@tika.apache.org
Subject: Re: Metadata situation and XMP support in Tika

I think the aliasing approach supports both use cases nicely, i.e.:

Metatadata.java:
...
   Property TITLE = DublinCore.DC_TITLE; ...

Users then only have to concern themselves with "give me the metadata that best fits the idea of Title, as defined by Tika", and not even have to know about DublinCore, but can dig into details of the implementation as needed.

This separation is less of a concern in the particular case of DublinCore since it is such as basic, broad, and widely accepted standard, but for other standards that direct inclusion in the Metadata interface makes less sense.  For example, at the moment we're essentially asking users to say "give me the metadata that best fits the idea of Keywords, as defined by MSOffice" which doesn't make a lot of sense when dealing with something like images.  If we aliased:

Metatadata.java:
...
   Property KEYWORDS = MSOffice.MS_KEYWORDS; ...

we're back to the intended "give me the metadata that best fits the idea of Keywords, as defined by Tika".  In this case, DublinCore.DC_SUBJECT is probably a much better standard to alias keywords from than MSOffice, but I'm just sticking to the current mappings for this example.

Ray

On Apr 24, 2012, at 7:43 AM, Nick Burch wrote:

> On Fri, 13 Apr 2012, Joerg Ehrlich wrote:
>> I think it would be more clear if parsers/clients would use the namespace or standard properties explicitly instead of the metadata one. But your idea of having a set of "standard" properties available in the Metadata class would be a good help for clients who don't care which "title" or "author" they read. They could just say "Metadata.title" instead of "DublinCore.title".
> 
> One thing to bear in mind is that we've tried to hide the differences in format's metadata from end users of Tika. You shouldn't need to know if a format calls it "description" or "subject" or "title" or "dc:title" or "WhatItsAllAbout". Someone who understands the format works out how to map the file format's metadata onto a common set. End users can then say "give me the metadata that best fits the idea of Title, as defined by Dublin Core" and they get something back. The intricacies of the file formats are hidden from them, they get clean and consistent metadata back.
> 
> I certainly see there are cases when someone may want the full set of 
> metadata back from a file, in quite a low level way, but we should 
> make sure we don't loose the ability of users to say "give me the 
> title of that document, no matter what the format stores it as" that 
> we currently have
> 
> Nick

Re: Metadata situation and XMP support in Tika

Posted by Ray Gauss II <ra...@alfresco.com>.

I think the aliasing approach supports both use cases nicely, i.e.:

Metatadata.java:
...
   Property TITLE = DublinCore.DC_TITLE;
...

Users then only have to concern themselves with "give me the metadata that best fits the idea of Title, as defined by Tika", and not even have to know about DublinCore, but can dig into details of the implementation as needed.

This separation is less of a concern in the particular case of DublinCore since it is such as basic, broad, and widely accepted standard, but for other standards that direct inclusion in the Metadata interface makes less sense.  For example, at the moment we're essentially asking users to say "give me the metadata that best fits the idea of Keywords, as defined by MSOffice" which doesn't make a lot of sense when dealing with something like images.  If we aliased:

Metatadata.java:
...
   Property KEYWORDS = MSOffice.MS_KEYWORDS;
...

we're back to the intended "give me the metadata that best fits the idea of Keywords, as defined by Tika".  In this case, DublinCore.DC_SUBJECT is probably a much better standard to alias keywords from than MSOffice, but I'm just sticking to the current mappings for this example.

Ray

On Apr 24, 2012, at 7:43 AM, Nick Burch wrote:

> On Fri, 13 Apr 2012, Joerg Ehrlich wrote:
>> I think it would be more clear if parsers/clients would use the namespace or standard properties explicitly instead of the metadata one. But your idea of having a set of "standard" properties available in the Metadata class would be a good help for clients who don't care which "title" or "author" they read. They could just say "Metadata.title" instead of "DublinCore.title".
> 
> One thing to bear in mind is that we've tried to hide the differences in format's metadata from end users of Tika. You shouldn't need to know if a format calls it "description" or "subject" or "title" or "dc:title" or "WhatItsAllAbout". Someone who understands the format works out how to map the file format's metadata onto a common set. End users can then say "give me the metadata that best fits the idea of Title, as defined by Dublin Core" and they get something back. The intricacies of the file formats are hidden from them, they get clean and consistent metadata back.
> 
> I certainly see there are cases when someone may want the full set of metadata back from a file, in quite a low level way, but we should make sure we don't loose the ability of users to say "give me the title of that document, no matter what the format stores it as" that we currently have
> 
> Nick

Re: Metadata situation and XMP support in Tika

Posted by Ingo Renner <in...@typo3.org>.

Am 24.04.2012 um 13:43 schrieb Nick Burch:

> I certainly see there are cases when someone may want the full set of metadata back from a file, in quite a low level way, but we should make sure we don't loose the ability of users to say "give me the title of that document, no matter what the format stores it as" that we currently have

+1

-- 
Ingo Renner
TYPO3 Core Developer, Release Manager TYPO3 4.2, Admin Google Summer of Code

TYPO3
Open Source Enterprise Content Management System
http://typo3.org

RE: Metadata situation and XMP support in Tika

Posted by Nick Burch <ni...@alfresco.com>.

On Fri, 13 Apr 2012, Joerg Ehrlich wrote:
> I think it would be more clear if parsers/clients would use the 
> namespace or standard properties explicitly instead of the metadata 
> one. But your idea of having a set of "standard" properties available in 
> the Metadata class would be a good help for clients who don't care which 
> "title" or "author" they read. They could just say "Metadata.title" 
> instead of "DublinCore.title".

One thing to bear in mind is that we've tried to hide the differences in 
format's metadata from end users of Tika. You shouldn't need to know if a 
format calls it "description" or "subject" or "title" or "dc:title" or 
"WhatItsAllAbout". Someone who understands the format works out how to map 
the file format's metadata onto a common set. End users can then say "give 
me the metadata that best fits the idea of Title, as defined by Dublin 
Core" and they get something back. The intricacies of the file formats 
are hidden from them, they get clean and consistent metadata back.

I certainly see there are cases when someone may want the full set of 
metadata back from a file, in quite a low level way, but we should make 
sure we don't loose the ability of users to say "give me the title of that 
document, no matter what the format stores it as" that we currently have

Nick

Re: Metadata situation and XMP support in Tika

Posted by Ray Gauss II <ra...@alfresco.com>.

Yeah, I think that was the original motivation behind the Metadata class, a simple set of common properties for devs.

I prefer the stricter aliasing approach to the list as I think it will be easier for devs who aren't intimately familiar with the standard they're working with.  If a dev is working on an app that needs to extract IPTC they'll probably be looking for IPTC.Keywords, since that's name in the IPTC standard and what it's called in Photoshop, etc., and may not know that I should actually be looking for IPTC.DC_SUBJECT.


On Apr 13, 2012, at 9:12 AM, Joerg Ehrlich wrote:

> Hi Ray,
> 
> Yes, that is pretty much what I would propose. Aliasing is one idea, or you could simply have a list like the ones at the end of IPTC class which simply reference the namespace properties. I haven't got a strong opinion here.
> And I'm with you that I don't really see the benefit of including all interfaces with the metadata class. I think it would be more clear if parsers/clients would use the namespace or standard properties explicitely instead of the metadata one. But your idea of having a set of "standard" properties available in the Metadata class would be a good help for clients who don't care which "title" or "author" they read. They could just say "Metadata.title" instead of "DublinCore.title".
> 
> Regards
> Jörg
> 
> 
> From: Ray Gauss II [mailto:ray.gauss@alfresco.com] 
> 
> For the IPTC example specifically, all properties are defined using their respective namespaces, but some are defined 'inline' while others are an alias to the referenced standard, i.e.
> 
>   Property KEYWORDS = DublinCore.DC_SUBJECT; 
> 
> If I'm understanding you correctly your proposal is to do that same aliasing for all the IPTC properties by creating new IptcCore, IptcExt, Photoshop, Plus, and XmpRights metadata interfaces that contain the full set of properties under those standards and simply referencing them from IPTC?
> 
> If so, I'm on board, and that's the direction I've wanted to take things.  I'd go so far as to say the Tika Metadata interface itself should cherry pick properties from other standards using that same aliasing approach rather than attempting to include the entire standard via implements which can obviously lead to name conflicts without prefixing the properties.
> 
> 
> On Apr 13, 2012, at 8:32 AM, Joerg Ehrlich wrote:
> 
>> Hi,
>> 
>> Looking at the current constants defined for the Metadata map, the interfaces do not follow a common pattern.
>> They are organized in interfaces for specific namespaces like DublinCore or XMPDM, there are interfaces for specific standards like IPTC or CreativeCommons and there are interfaces for a specific functional purpose like Geographic or MSOffice. There are also namespace interfaces that mix properties from different namespaces, e.g. TIFF. 
>> Overall a clear separation of responsibility and semantic is not always ensured. 
>> 
>> I would propose to reorganize and rename the interfaces in two groups: First in namespaces and second in standards which simply contain lists with the properties they use from the namespace interfaces.
>> The reason is that only those two concepts have unambiguous and clearly defined semantics where each client knows what to do with it.
>> Properties which are currently not connected to a namespace (like the properties from MSOffice interface) would also be moved to a namespace.
>> 
>> Old property definitions should be kept intact, of course, for existing clients, but that is independent of the internal interface organization.
>> 
>> For example the IPTC standard uses properties from six different namespaces (dc, photoshop, plus, iptc_core, iptc_ext, xmp_rights), but not all of the properties that are defined in those namespaces. 
>> I think it would make sense to have in this case six interfaces for the namespaces which contain all properties from those namespaces. And one interface for IPTC itself which contains just lists of the properties they use from the namespaces. The IPTC interface already has those lists, I would just remove all the property definitions from it.
>> A mapping of EXIF properties to XMP for example, which is defined to use five namespaces (exif, exifEX, tiff, xmp and dc), can then also reuse the properties defined in the namespace interfaces.
>> The functional interface "Geographic" I would rename for example to "W3C_Geographic" or that like as it clearly defines the semantic which is bound to the W3C vocabulary, which is different then what is meant by the mapping to the EXIF namespace defined by CIPA.
>> In case of the MSOffice metadata this could either be mapped to properties defined in Open Document standard [1] or the Microsoft OOXML one [2].
>> 
>> A parser can then map properties to those namespaces it sees fit or several at the same time and the client can then decide which semantic (i.e. properties) it would like to use.
>> 
>> Regards
>> Jörg
>> 
>> [1] 
>> http://docs.oasis-open.org/office/v1.2/os/OpenDocument-v1.2-os-part1.h
>> tml [2] 
>> http://www.ecma-international.org/publications/standards/Ecma-376.htm  
>> (in part 2)
>> 
>> ---
>> Jörg Ehrlich | Computer Scientist | XMP Technology | Adobe Systems | 
>> joerg.ehrlich@adobe.com | work: +49(40)306360
>> 
>> -----Original Message-----
>> From: Mattmann, Chris A (388J) [mailto:chris.a.mattmann@jpl.nasa.gov]
>> Sent: Donnerstag, 5. April 2012 16:21
>> To: dev@tika.apache.org
>> Subject: Re: Metadata situation and XMP support in Tika
>> 
>> Hi Jörg,
>> 
>> Great summary! I would be in favor of option #2 as well, with the caveat that if we take it slow, I think there might be a way to not really have as much of a client/API impact, using deprecations and other techniques as you suggested.
>> 
>> Looking forward to your participation!
>> 
>> Cheers,
>> Chris
>> 
>> On Apr 5, 2012, at 5:58 AM, Joerg Ehrlich wrote:
>> 
>>> Hi everyone,
>>> 
>>> I am an engineer in the XMP/Metadata team at Adobe and we would like to leverage Tika in current projects for metadata extraction (and mimetype detection).
>>> Our current systems primarily use the XMP data model to manage and interact with metadata.
>>> As far as I can see, the support for the XMP data model and also for standard metadata schemas/namespaces (like IPTC, Exif, etc.) in Tika is pretty suboptimal as of today.
>>> But instead of wrapping Tika in own layers of code in our systems, we feel that it would be more useful to contribute to the project instead going forward.
>>> 
>>> I have had a deeper look in Tika and how to improve the metadata/XMP output of it.
>>> I saw that you have a bug for XMP already (TIKA-756), which I would probably use to submit any patches related to that.
>>> But I am currently unsure what the best approach would be to do the mapping to XMP and I would like to hear your opinion on it before starting any work.
>>> 
>>> Let me quickly summarize if I have understood the basic metadata concept correctly:
>>> 
>>> 1.       Each parser fills a Metadata map which is a simple key-value list where values can also be multi-values
>>> 
>>> 2.       Mostly the keys for the Metadata map are taken from fixed lists which are defined as interfaces in the Metadata class
>>> 
>>> 3.       Those keys are usually Property objects, where the Property class also serves as a static list which registers every property that is created in the Metadata interfaces. This Property class resembles the XMP data model to some extend but does not store e.g. any hierarchical information. And it leaves every client the choice to store property names with prefixes or not.
>>> 
>>> 4.       Any metadata outputter just iterates over the Metadata map and could query the Property list for additional information.
>>> 
>>> 5.       In case of the XMP outputter (XMPContentHandler) only those properties are outputted which are stored with a prefix in the Property list.
>>> 
>>> 
>>> I see two potential ways to improve the situation:
>>> 
>>> 
>>> 1.       Have a fixed mapping table for each mime type which would be used in XMPContentHandler to map from the Metadata map to the XMP data model. Such mapping tables would be pretty ugly as each parser produces different metadata maps and there is no consistent way to handle them. This option would be least invasive for other clients of Tika but would also be a real hack and would not really improve the metadata situation in Tika in general.
>>> 
>>> 2.       Try to improve the Key interface lists of Metadata class and adjust all parsers accordingly. This could be done by adding new keys with prefixes and keeping/deprecating the existing ones to not disturb existing clients. Similar to what is proposed for the DublinCore namespace in TIKA-859 and TIKA-842.
>>> This would be more invasive but would offer the opportunity to really improve the metadata situation. I already saw a couple of places in the code that clearly break existing standards. But there are also examples where mapping might have to be done to different properties at the same time: If you look at the mapping of GPS data from Exif, this is currently mapped to W3C vocabulary in Tika. But in XMP this mapping is defined otherwise [1] by CIPA (the EXIF standardization committee). So probably both mappings have to be supported.
>>> 
>>> I personally would prefer option two. What do you think?
>>> Looking forward to working with you guys.
>>> Regards
>>> Jörg
>>> 
>>> [1]
>>> http://www.cipa.jp/english/hyoujunka/kikaku/cipa_e_kikaku_list.html
>>> 
>> 
>> 
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Senior Computer Scientist
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 171-266B, Mailstop: 171-246
>> Email: chris.a.mattmann@nasa.gov
>> WWW:   http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Assistant Professor, Computer Science Department University of 
>> Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 
>

RE: Metadata situation and XMP support in Tika

Posted by Joerg Ehrlich <je...@adobe.com>.

Hi Ray,

Yes, that is pretty much what I would propose. Aliasing is one idea, or you could simply have a list like the ones at the end of IPTC class which simply reference the namespace properties. I haven't got a strong opinion here.
And I'm with you that I don't really see the benefit of including all interfaces with the metadata class. I think it would be more clear if parsers/clients would use the namespace or standard properties explicitely instead of the metadata one. But your idea of having a set of "standard" properties available in the Metadata class would be a good help for clients who don't care which "title" or "author" they read. They could just say "Metadata.title" instead of "DublinCore.title".

Regards
Jörg


From: Ray Gauss II [mailto:ray.gauss@alfresco.com] 

For the IPTC example specifically, all properties are defined using their respective namespaces, but some are defined 'inline' while others are an alias to the referenced standard, i.e.

   Property KEYWORDS = DublinCore.DC_SUBJECT; 

If I'm understanding you correctly your proposal is to do that same aliasing for all the IPTC properties by creating new IptcCore, IptcExt, Photoshop, Plus, and XmpRights metadata interfaces that contain the full set of properties under those standards and simply referencing them from IPTC?

If so, I'm on board, and that's the direction I've wanted to take things.  I'd go so far as to say the Tika Metadata interface itself should cherry pick properties from other standards using that same aliasing approach rather than attempting to include the entire standard via implements which can obviously lead to name conflicts without prefixing the properties.


On Apr 13, 2012, at 8:32 AM, Joerg Ehrlich wrote:

> Hi,
> 
> Looking at the current constants defined for the Metadata map, the interfaces do not follow a common pattern.
> They are organized in interfaces for specific namespaces like DublinCore or XMPDM, there are interfaces for specific standards like IPTC or CreativeCommons and there are interfaces for a specific functional purpose like Geographic or MSOffice. There are also namespace interfaces that mix properties from different namespaces, e.g. TIFF. 
> Overall a clear separation of responsibility and semantic is not always ensured. 
> 
> I would propose to reorganize and rename the interfaces in two groups: First in namespaces and second in standards which simply contain lists with the properties they use from the namespace interfaces.
> The reason is that only those two concepts have unambiguous and clearly defined semantics where each client knows what to do with it.
> Properties which are currently not connected to a namespace (like the properties from MSOffice interface) would also be moved to a namespace.
> 
> Old property definitions should be kept intact, of course, for existing clients, but that is independent of the internal interface organization.
> 
> For example the IPTC standard uses properties from six different namespaces (dc, photoshop, plus, iptc_core, iptc_ext, xmp_rights), but not all of the properties that are defined in those namespaces. 
> I think it would make sense to have in this case six interfaces for the namespaces which contain all properties from those namespaces. And one interface for IPTC itself which contains just lists of the properties they use from the namespaces. The IPTC interface already has those lists, I would just remove all the property definitions from it.
> A mapping of EXIF properties to XMP for example, which is defined to use five namespaces (exif, exifEX, tiff, xmp and dc), can then also reuse the properties defined in the namespace interfaces.
> The functional interface "Geographic" I would rename for example to "W3C_Geographic" or that like as it clearly defines the semantic which is bound to the W3C vocabulary, which is different then what is meant by the mapping to the EXIF namespace defined by CIPA.
> In case of the MSOffice metadata this could either be mapped to properties defined in Open Document standard [1] or the Microsoft OOXML one [2].
> 
> A parser can then map properties to those namespaces it sees fit or several at the same time and the client can then decide which semantic (i.e. properties) it would like to use.
> 
> Regards
> Jörg
> 
> [1] 
> http://docs.oasis-open.org/office/v1.2/os/OpenDocument-v1.2-os-part1.h
> tml [2] 
> http://www.ecma-international.org/publications/standards/Ecma-376.htm  
> (in part 2)
> 
> ---
> Jörg Ehrlich | Computer Scientist | XMP Technology | Adobe Systems | 
> joerg.ehrlich@adobe.com | work: +49(40)306360
> 
> -----Original Message-----
> From: Mattmann, Chris A (388J) [mailto:chris.a.mattmann@jpl.nasa.gov]
> Sent: Donnerstag, 5. April 2012 16:21
> To: dev@tika.apache.org
> Subject: Re: Metadata situation and XMP support in Tika
> 
> Hi Jörg,
> 
> Great summary! I would be in favor of option #2 as well, with the caveat that if we take it slow, I think there might be a way to not really have as much of a client/API impact, using deprecations and other techniques as you suggested.
> 
> Looking forward to your participation!
> 
> Cheers,
> Chris
> 
> On Apr 5, 2012, at 5:58 AM, Joerg Ehrlich wrote:
> 
>> Hi everyone,
>> 
>> I am an engineer in the XMP/Metadata team at Adobe and we would like to leverage Tika in current projects for metadata extraction (and mimetype detection).
>> Our current systems primarily use the XMP data model to manage and interact with metadata.
>> As far as I can see, the support for the XMP data model and also for standard metadata schemas/namespaces (like IPTC, Exif, etc.) in Tika is pretty suboptimal as of today.
>> But instead of wrapping Tika in own layers of code in our systems, we feel that it would be more useful to contribute to the project instead going forward.
>> 
>> I have had a deeper look in Tika and how to improve the metadata/XMP output of it.
>> I saw that you have a bug for XMP already (TIKA-756), which I would probably use to submit any patches related to that.
>> But I am currently unsure what the best approach would be to do the mapping to XMP and I would like to hear your opinion on it before starting any work.
>> 
>> Let me quickly summarize if I have understood the basic metadata concept correctly:
>> 
>> 1.       Each parser fills a Metadata map which is a simple key-value list where values can also be multi-values
>> 
>> 2.       Mostly the keys for the Metadata map are taken from fixed lists which are defined as interfaces in the Metadata class
>> 
>> 3.       Those keys are usually Property objects, where the Property class also serves as a static list which registers every property that is created in the Metadata interfaces. This Property class resembles the XMP data model to some extend but does not store e.g. any hierarchical information. And it leaves every client the choice to store property names with prefixes or not.
>> 
>> 4.       Any metadata outputter just iterates over the Metadata map and could query the Property list for additional information.
>> 
>> 5.       In case of the XMP outputter (XMPContentHandler) only those properties are outputted which are stored with a prefix in the Property list.
>> 
>> 
>> I see two potential ways to improve the situation:
>> 
>> 
>> 1.       Have a fixed mapping table for each mime type which would be used in XMPContentHandler to map from the Metadata map to the XMP data model. Such mapping tables would be pretty ugly as each parser produces different metadata maps and there is no consistent way to handle them. This option would be least invasive for other clients of Tika but would also be a real hack and would not really improve the metadata situation in Tika in general.
>> 
>> 2.       Try to improve the Key interface lists of Metadata class and adjust all parsers accordingly. This could be done by adding new keys with prefixes and keeping/deprecating the existing ones to not disturb existing clients. Similar to what is proposed for the DublinCore namespace in TIKA-859 and TIKA-842.
>> This would be more invasive but would offer the opportunity to really improve the metadata situation. I already saw a couple of places in the code that clearly break existing standards. But there are also examples where mapping might have to be done to different properties at the same time: If you look at the mapping of GPS data from Exif, this is currently mapped to W3C vocabulary in Tika. But in XMP this mapping is defined otherwise [1] by CIPA (the EXIF standardization committee). So probably both mappings have to be supported.
>> 
>> I personally would prefer option two. What do you think?
>> Looking forward to working with you guys.
>> Regards
>> Jörg
>> 
>> [1]
>> http://www.cipa.jp/english/hyoujunka/kikaku/cipa_e_kikaku_list.html
>> 
> 
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattmann@nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department University of 
> Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>

Re: Metadata situation and XMP support in Tika

Posted by Ray Gauss II <ra...@alfresco.com>.

For the IPTC example specifically, all properties are defined using their respective namespaces, but some are defined 'inline' while others are an alias to the referenced standard, i.e.

   Property KEYWORDS = DublinCore.DC_SUBJECT; 

If I'm understanding you correctly your proposal is to do that same aliasing for all the IPTC properties by creating new IptcCore, IptcExt, Photoshop, Plus, and XmpRights metadata interfaces that contain the full set of properties under those standards and simply referencing them from IPTC?

If so, I'm on board, and that's the direction I've wanted to take things.  I'd go so far as to say the Tika Metadata interface itself should cherry pick properties from other standards using that same aliasing approach rather than attempting to include the entire standard via implements which can obviously lead to name conflicts without prefixing the properties.


On Apr 13, 2012, at 8:32 AM, Joerg Ehrlich wrote:

> Hi,
> 
> Looking at the current constants defined for the Metadata map, the interfaces do not follow a common pattern.
> They are organized in interfaces for specific namespaces like DublinCore or XMPDM, there are interfaces for specific standards like IPTC or CreativeCommons and there are interfaces for a specific functional purpose like Geographic or MSOffice. There are also namespace interfaces that mix properties from different namespaces, e.g. TIFF. 
> Overall a clear separation of responsibility and semantic is not always ensured. 
> 
> I would propose to reorganize and rename the interfaces in two groups: First in namespaces and second in standards which simply contain lists with the properties they use from the namespace interfaces.
> The reason is that only those two concepts have unambiguous and clearly defined semantics where each client knows what to do with it.
> Properties which are currently not connected to a namespace (like the properties from MSOffice interface) would also be moved to a namespace.
> 
> Old property definitions should be kept intact, of course, for existing clients, but that is independent of the internal interface organization.
> 
> For example the IPTC standard uses properties from six different namespaces (dc, photoshop, plus, iptc_core, iptc_ext, xmp_rights), but not all of the properties that are defined in those namespaces. 
> I think it would make sense to have in this case six interfaces for the namespaces which contain all properties from those namespaces. And one interface for IPTC itself which contains just lists of the properties they use from the namespaces. The IPTC interface already has those lists, I would just remove all the property definitions from it.
> A mapping of EXIF properties to XMP for example, which is defined to use five namespaces (exif, exifEX, tiff, xmp and dc), can then also reuse the properties defined in the namespace interfaces.
> The functional interface "Geographic" I would rename for example to "W3C_Geographic" or that like as it clearly defines the semantic which is bound to the W3C vocabulary, which is different then what is meant by the mapping to the EXIF namespace defined by CIPA.
> In case of the MSOffice metadata this could either be mapped to properties defined in Open Document standard [1] or the Microsoft OOXML one [2].
> 
> A parser can then map properties to those namespaces it sees fit or several at the same time and the client can then decide which semantic (i.e. properties) it would like to use.
> 
> Regards
> Jörg
> 
> [1] http://docs.oasis-open.org/office/v1.2/os/OpenDocument-v1.2-os-part1.html
> [2] http://www.ecma-international.org/publications/standards/Ecma-376.htm  (in part 2)
> 
> ---
> Jörg Ehrlich | Computer Scientist | XMP Technology | Adobe Systems | joerg.ehrlich@adobe.com | work: +49(40)306360
> 
> -----Original Message-----
> From: Mattmann, Chris A (388J) [mailto:chris.a.mattmann@jpl.nasa.gov] 
> Sent: Donnerstag, 5. April 2012 16:21
> To: dev@tika.apache.org
> Subject: Re: Metadata situation and XMP support in Tika
> 
> Hi Jörg,
> 
> Great summary! I would be in favor of option #2 as well, with the caveat that if we take it slow, I think there might be a way to not really have as much of a client/API impact, using deprecations and other techniques as you suggested.
> 
> Looking forward to your participation!
> 
> Cheers,
> Chris
> 
> On Apr 5, 2012, at 5:58 AM, Joerg Ehrlich wrote:
> 
>> Hi everyone,
>> 
>> I am an engineer in the XMP/Metadata team at Adobe and we would like to leverage Tika in current projects for metadata extraction (and mimetype detection).
>> Our current systems primarily use the XMP data model to manage and interact with metadata.
>> As far as I can see, the support for the XMP data model and also for standard metadata schemas/namespaces (like IPTC, Exif, etc.) in Tika is pretty suboptimal as of today.
>> But instead of wrapping Tika in own layers of code in our systems, we feel that it would be more useful to contribute to the project instead going forward.
>> 
>> I have had a deeper look in Tika and how to improve the metadata/XMP output of it.
>> I saw that you have a bug for XMP already (TIKA-756), which I would probably use to submit any patches related to that.
>> But I am currently unsure what the best approach would be to do the mapping to XMP and I would like to hear your opinion on it before starting any work.
>> 
>> Let me quickly summarize if I have understood the basic metadata concept correctly:
>> 
>> 1.       Each parser fills a Metadata map which is a simple key-value list where values can also be multi-values
>> 
>> 2.       Mostly the keys for the Metadata map are taken from fixed lists which are defined as interfaces in the Metadata class
>> 
>> 3.       Those keys are usually Property objects, where the Property class also serves as a static list which registers every property that is created in the Metadata interfaces. This Property class resembles the XMP data model to some extend but does not store e.g. any hierarchical information. And it leaves every client the choice to store property names with prefixes or not.
>> 
>> 4.       Any metadata outputter just iterates over the Metadata map and could query the Property list for additional information.
>> 
>> 5.       In case of the XMP outputter (XMPContentHandler) only those properties are outputted which are stored with a prefix in the Property list.
>> 
>> 
>> I see two potential ways to improve the situation:
>> 
>> 
>> 1.       Have a fixed mapping table for each mime type which would be used in XMPContentHandler to map from the Metadata map to the XMP data model. Such mapping tables would be pretty ugly as each parser produces different metadata maps and there is no consistent way to handle them. This option would be least invasive for other clients of Tika but would also be a real hack and would not really improve the metadata situation in Tika in general.
>> 
>> 2.       Try to improve the Key interface lists of Metadata class and adjust all parsers accordingly. This could be done by adding new keys with prefixes and keeping/deprecating the existing ones to not disturb existing clients. Similar to what is proposed for the DublinCore namespace in TIKA-859 and TIKA-842.
>> This would be more invasive but would offer the opportunity to really improve the metadata situation. I already saw a couple of places in the code that clearly break existing standards. But there are also examples where mapping might have to be done to different properties at the same time: If you look at the mapping of GPS data from Exif, this is currently mapped to W3C vocabulary in Tika. But in XMP this mapping is defined otherwise [1] by CIPA (the EXIF standardization committee). So probably both mappings have to be supported.
>> 
>> I personally would prefer option two. What do you think?
>> Looking forward to working with you guys.
>> Regards
>> Jörg
>> 
>> [1] 
>> http://www.cipa.jp/english/hyoujunka/kikaku/cipa_e_kikaku_list.html
>> 
> 
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattmann@nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>

RE: Metadata situation and XMP support in Tika

Posted by Joerg Ehrlich <je...@adobe.com>.

Hi,

Looking at the current constants defined for the Metadata map, the interfaces do not follow a common pattern.
They are organized in interfaces for specific namespaces like DublinCore or XMPDM, there are interfaces for specific standards like IPTC or CreativeCommons and there are interfaces for a specific functional purpose like Geographic or MSOffice. There are also namespace interfaces that mix properties from different namespaces, e.g. TIFF. 
Overall a clear separation of responsibility and semantic is not always ensured. 

I would propose to reorganize and rename the interfaces in two groups: First in namespaces and second in standards which simply contain lists with the properties they use from the namespace interfaces.
The reason is that only those two concepts have unambiguous and clearly defined semantics where each client knows what to do with it.
Properties which are currently not connected to a namespace (like the properties from MSOffice interface) would also be moved to a namespace.

Old property definitions should be kept intact, of course, for existing clients, but that is independent of the internal interface organization.

For example the IPTC standard uses properties from six different namespaces (dc, photoshop, plus, iptc_core, iptc_ext, xmp_rights), but not all of the properties that are defined in those namespaces. 
I think it would make sense to have in this case six interfaces for the namespaces which contain all properties from those namespaces. And one interface for IPTC itself which contains just lists of the properties they use from the namespaces. The IPTC interface already has those lists, I would just remove all the property definitions from it.
A mapping of EXIF properties to XMP for example, which is defined to use five namespaces (exif, exifEX, tiff, xmp and dc), can then also reuse the properties defined in the namespace interfaces.
The functional interface "Geographic" I would rename for example to "W3C_Geographic" or that like as it clearly defines the semantic which is bound to the W3C vocabulary, which is different then what is meant by the mapping to the EXIF namespace defined by CIPA.
In case of the MSOffice metadata this could either be mapped to properties defined in Open Document standard [1] or the Microsoft OOXML one [2].

A parser can then map properties to those namespaces it sees fit or several at the same time and the client can then decide which semantic (i.e. properties) it would like to use.

Regards
Jörg

[1] http://docs.oasis-open.org/office/v1.2/os/OpenDocument-v1.2-os-part1.html
[2] http://www.ecma-international.org/publications/standards/Ecma-376.htm  (in part 2)

---
Jörg Ehrlich | Computer Scientist | XMP Technology | Adobe Systems | joerg.ehrlich@adobe.com | work: +49(40)306360

-----Original Message-----
From: Mattmann, Chris A (388J) [mailto:chris.a.mattmann@jpl.nasa.gov] 
Sent: Donnerstag, 5. April 2012 16:21
To: dev@tika.apache.org
Subject: Re: Metadata situation and XMP support in Tika

Hi Jörg,

Great summary! I would be in favor of option #2 as well, with the caveat that if we take it slow, I think there might be a way to not really have as much of a client/API impact, using deprecations and other techniques as you suggested.

Looking forward to your participation!

Cheers,
Chris

On Apr 5, 2012, at 5:58 AM, Joerg Ehrlich wrote:

> Hi everyone,
> 
> I am an engineer in the XMP/Metadata team at Adobe and we would like to leverage Tika in current projects for metadata extraction (and mimetype detection).
> Our current systems primarily use the XMP data model to manage and interact with metadata.
> As far as I can see, the support for the XMP data model and also for standard metadata schemas/namespaces (like IPTC, Exif, etc.) in Tika is pretty suboptimal as of today.
> But instead of wrapping Tika in own layers of code in our systems, we feel that it would be more useful to contribute to the project instead going forward.
> 
> I have had a deeper look in Tika and how to improve the metadata/XMP output of it.
> I saw that you have a bug for XMP already (TIKA-756), which I would probably use to submit any patches related to that.
> But I am currently unsure what the best approach would be to do the mapping to XMP and I would like to hear your opinion on it before starting any work.
> 
> Let me quickly summarize if I have understood the basic metadata concept correctly:
> 
> 1.       Each parser fills a Metadata map which is a simple key-value list where values can also be multi-values
> 
> 2.       Mostly the keys for the Metadata map are taken from fixed lists which are defined as interfaces in the Metadata class
> 
> 3.       Those keys are usually Property objects, where the Property class also serves as a static list which registers every property that is created in the Metadata interfaces. This Property class resembles the XMP data model to some extend but does not store e.g. any hierarchical information. And it leaves every client the choice to store property names with prefixes or not.
> 
> 4.       Any metadata outputter just iterates over the Metadata map and could query the Property list for additional information.
> 
> 5.       In case of the XMP outputter (XMPContentHandler) only those properties are outputted which are stored with a prefix in the Property list.
> 
> 
> I see two potential ways to improve the situation:
> 
> 
> 1.       Have a fixed mapping table for each mime type which would be used in XMPContentHandler to map from the Metadata map to the XMP data model. Such mapping tables would be pretty ugly as each parser produces different metadata maps and there is no consistent way to handle them. This option would be least invasive for other clients of Tika but would also be a real hack and would not really improve the metadata situation in Tika in general.
> 
> 2.       Try to improve the Key interface lists of Metadata class and adjust all parsers accordingly. This could be done by adding new keys with prefixes and keeping/deprecating the existing ones to not disturb existing clients. Similar to what is proposed for the DublinCore namespace in TIKA-859 and TIKA-842.
> This would be more invasive but would offer the opportunity to really improve the metadata situation. I already saw a couple of places in the code that clearly break existing standards. But there are also examples where mapping might have to be done to different properties at the same time: If you look at the mapping of GPS data from Exif, this is currently mapped to W3C vocabulary in Tika. But in XMP this mapping is defined otherwise [1] by CIPA (the EXIF standardization committee). So probably both mappings have to be supported.
> 
> I personally would prefer option two. What do you think?
> Looking forward to working with you guys.
> Regards
> Jörg
> 
> [1] 
> http://www.cipa.jp/english/hyoujunka/kikaku/cipa_e_kikaku_list.html
> 

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: Metadata situation and XMP support in Tika

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.

Hi Jörg,

Great summary! I would be in favor of option #2 as well, with the caveat that if we take it slow, I think there might be a way to 
not really have as much of a client/API impact, using deprecations and other techniques as you suggested.

Looking forward to your participation!

Cheers,
Chris

On Apr 5, 2012, at 5:58 AM, Joerg Ehrlich wrote:

> Hi everyone,
> 
> I am an engineer in the XMP/Metadata team at Adobe and we would like to leverage Tika in current projects for metadata extraction (and mimetype detection).
> Our current systems primarily use the XMP data model to manage and interact with metadata.
> As far as I can see, the support for the XMP data model and also for standard metadata schemas/namespaces (like IPTC, Exif, etc.) in Tika is pretty suboptimal as of today.
> But instead of wrapping Tika in own layers of code in our systems, we feel that it would be more useful to contribute to the project instead going forward.
> 
> I have had a deeper look in Tika and how to improve the metadata/XMP output of it.
> I saw that you have a bug for XMP already (TIKA-756), which I would probably use to submit any patches related to that.
> But I am currently unsure what the best approach would be to do the mapping to XMP and I would like to hear your opinion on it before starting any work.
> 
> Let me quickly summarize if I have understood the basic metadata concept correctly:
> 
> 1.       Each parser fills a Metadata map which is a simple key-value list where values can also be multi-values
> 
> 2.       Mostly the keys for the Metadata map are taken from fixed lists which are defined as interfaces in the Metadata class
> 
> 3.       Those keys are usually Property objects, where the Property class also serves as a static list which registers every property that is created in the Metadata interfaces. This Property class resembles the XMP data model to some extend but does not store e.g. any hierarchical information. And it leaves every client the choice to store property names with prefixes or not.
> 
> 4.       Any metadata outputter just iterates over the Metadata map and could query the Property list for additional information.
> 
> 5.       In case of the XMP outputter (XMPContentHandler) only those properties are outputted which are stored with a prefix in the Property list.
> 
> 
> I see two potential ways to improve the situation:
> 
> 
> 1.       Have a fixed mapping table for each mime type which would be used in XMPContentHandler to map from the Metadata map to the XMP data model. Such mapping tables would be pretty ugly as each parser produces different metadata maps and there is no consistent way to handle them. This option would be least invasive for other clients of Tika but would also be a real hack and would not really improve the metadata situation in Tika in general.
> 
> 2.       Try to improve the Key interface lists of Metadata class and adjust all parsers accordingly. This could be done by adding new keys with prefixes and keeping/deprecating the existing ones to not disturb existing clients. Similar to what is proposed for the DublinCore namespace in TIKA-859 and TIKA-842.
> This would be more invasive but would offer the opportunity to really improve the metadata situation. I already saw a couple of places in the code that clearly break existing standards. But there are also examples where mapping might have to be done to different properties at the same time: If you look at the mapping of GPS data from Exif, this is currently mapped to W3C vocabulary in Tika. But in XMP this mapping is defined otherwise [1] by CIPA (the EXIF standardization committee). So probably both mappings have to be supported.
> 
> I personally would prefer option two. What do you think?
> Looking forward to working with you guys.
> Regards
> Jörg
> 
> [1] http://www.cipa.jp/english/hyoujunka/kikaku/cipa_e_kikaku_list.html
> 


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

RE: Metadata situation and XMP support in Tika

Posted by Joerg Ehrlich <je...@adobe.com>.

Hi Ray,

Using ExifTool as external parser is a good idea.
Currently at Adobe we also use the XMPFiles C++ library in our Java projects to read/write metadata, although not as a Tika parser (yet). But that is one idea for the future.
And yes, we should definitely coordinate on the metadata enhancements :)

Jörg

-----Original Message-----
From: Ray Gauss II [mailto:ray.gauss@alfresco.com] 
Sent: Mittwoch, 11. April 2012 00:04
To: dev@tika.apache.org
Subject: Re: Metadata situation and XMP support in Tika

Hi Jörg,

As you've seen from TIKA-859 and TIKA-842 I've had to deal with similar issues.

Those issues were needed by TIKA-774 which itself contains another mapping that converts the data output by ExifTool to the proper IPTC metadata defined in TIKA-842.

The code for the ExifTool parser is now at https://github.com/Alfresco/tika-exiftool, and that mapping specifically is at:

https://github.com/Alfresco/tika-exiftool/blob/master/src/main/java/org/apache/tika/parser/exiftool/ExiftoolTikaIptcMapper.java

I'm more than happy to coordinate with you on the XMP stuff going forward if you'd like.

Ray Gauss II
DAM Architect, Alfresco

On Apr 5, 2012, at 8:58 AM, Joerg Ehrlich wrote:

> Hi everyone,
> 
> I am an engineer in the XMP/Metadata team at Adobe and we would like to leverage Tika in current projects for metadata extraction (and mimetype detection).
> Our current systems primarily use the XMP data model to manage and interact with metadata.
> As far as I can see, the support for the XMP data model and also for standard metadata schemas/namespaces (like IPTC, Exif, etc.) in Tika is pretty suboptimal as of today.
> But instead of wrapping Tika in own layers of code in our systems, we feel that it would be more useful to contribute to the project instead going forward.
> 
> I have had a deeper look in Tika and how to improve the metadata/XMP output of it.
> I saw that you have a bug for XMP already (TIKA-756), which I would probably use to submit any patches related to that.
> But I am currently unsure what the best approach would be to do the mapping to XMP and I would like to hear your opinion on it before starting any work.
> 
> Let me quickly summarize if I have understood the basic metadata concept correctly:
> 
> 1.       Each parser fills a Metadata map which is a simple key-value list where values can also be multi-values
> 
> 2.       Mostly the keys for the Metadata map are taken from fixed lists which are defined as interfaces in the Metadata class
> 
> 3.       Those keys are usually Property objects, where the Property class also serves as a static list which registers every property that is created in the Metadata interfaces. This Property class resembles the XMP data model to some extend but does not store e.g. any hierarchical information. And it leaves every client the choice to store property names with prefixes or not.
> 
> 4.       Any metadata outputter just iterates over the Metadata map and could query the Property list for additional information.
> 
> 5.       In case of the XMP outputter (XMPContentHandler) only those properties are outputted which are stored with a prefix in the Property list.
> 
> 
> I see two potential ways to improve the situation:
> 
> 
> 1.       Have a fixed mapping table for each mime type which would be used in XMPContentHandler to map from the Metadata map to the XMP data model. Such mapping tables would be pretty ugly as each parser produces different metadata maps and there is no consistent way to handle them. This option would be least invasive for other clients of Tika but would also be a real hack and would not really improve the metadata situation in Tika in general.
> 
> 2.       Try to improve the Key interface lists of Metadata class and adjust all parsers accordingly. This could be done by adding new keys with prefixes and keeping/deprecating the existing ones to not disturb existing clients. Similar to what is proposed for the DublinCore namespace in TIKA-859 and TIKA-842.
> This would be more invasive but would offer the opportunity to really improve the metadata situation. I already saw a couple of places in the code that clearly break existing standards. But there are also examples where mapping might have to be done to different properties at the same time: If you look at the mapping of GPS data from Exif, this is currently mapped to W3C vocabulary in Tika. But in XMP this mapping is defined otherwise [1] by CIPA (the EXIF standardization committee). So probably both mappings have to be supported.
> 
> I personally would prefer option two. What do you think?
> Looking forward to working with you guys.
> Regards
> Jörg
> 
> [1] http://www.cipa.jp/english/hyoujunka/kikaku/cipa_e_kikaku_list.html
>

Re: Metadata situation and XMP support in Tika

Posted by Ray Gauss II <ra...@alfresco.com>.

Hi Jörg,

As you've seen from TIKA-859 and TIKA-842 I've had to deal with similar issues.

Those issues were needed by TIKA-774 which itself contains another mapping that converts the data output by ExifTool to the proper IPTC metadata defined in TIKA-842.

The code for the ExifTool parser is now at https://github.com/Alfresco/tika-exiftool, and that mapping specifically is at:

https://github.com/Alfresco/tika-exiftool/blob/master/src/main/java/org/apache/tika/parser/exiftool/ExiftoolTikaIptcMapper.java

I'm more than happy to coordinate with you on the XMP stuff going forward if you'd like.

Ray Gauss II
DAM Architect, Alfresco

On Apr 5, 2012, at 8:58 AM, Joerg Ehrlich wrote:

> Hi everyone,
> 
> I am an engineer in the XMP/Metadata team at Adobe and we would like to leverage Tika in current projects for metadata extraction (and mimetype detection).
> Our current systems primarily use the XMP data model to manage and interact with metadata.
> As far as I can see, the support for the XMP data model and also for standard metadata schemas/namespaces (like IPTC, Exif, etc.) in Tika is pretty suboptimal as of today.
> But instead of wrapping Tika in own layers of code in our systems, we feel that it would be more useful to contribute to the project instead going forward.
> 
> I have had a deeper look in Tika and how to improve the metadata/XMP output of it.
> I saw that you have a bug for XMP already (TIKA-756), which I would probably use to submit any patches related to that.
> But I am currently unsure what the best approach would be to do the mapping to XMP and I would like to hear your opinion on it before starting any work.
> 
> Let me quickly summarize if I have understood the basic metadata concept correctly:
> 
> 1.       Each parser fills a Metadata map which is a simple key-value list where values can also be multi-values
> 
> 2.       Mostly the keys for the Metadata map are taken from fixed lists which are defined as interfaces in the Metadata class
> 
> 3.       Those keys are usually Property objects, where the Property class also serves as a static list which registers every property that is created in the Metadata interfaces. This Property class resembles the XMP data model to some extend but does not store e.g. any hierarchical information. And it leaves every client the choice to store property names with prefixes or not.
> 
> 4.       Any metadata outputter just iterates over the Metadata map and could query the Property list for additional information.
> 
> 5.       In case of the XMP outputter (XMPContentHandler) only those properties are outputted which are stored with a prefix in the Property list.
> 
> 
> I see two potential ways to improve the situation:
> 
> 
> 1.       Have a fixed mapping table for each mime type which would be used in XMPContentHandler to map from the Metadata map to the XMP data model. Such mapping tables would be pretty ugly as each parser produces different metadata maps and there is no consistent way to handle them. This option would be least invasive for other clients of Tika but would also be a real hack and would not really improve the metadata situation in Tika in general.
> 
> 2.       Try to improve the Key interface lists of Metadata class and adjust all parsers accordingly. This could be done by adding new keys with prefixes and keeping/deprecating the existing ones to not disturb existing clients. Similar to what is proposed for the DublinCore namespace in TIKA-859 and TIKA-842.
> This would be more invasive but would offer the opportunity to really improve the metadata situation. I already saw a couple of places in the code that clearly break existing standards. But there are also examples where mapping might have to be done to different properties at the same time: If you look at the mapping of GPS data from Exif, this is currently mapped to W3C vocabulary in Tika. But in XMP this mapping is defined otherwise [1] by CIPA (the EXIF standardization committee). So probably both mappings have to be supported.
> 
> I personally would prefer option two. What do you think?
> Looking forward to working with you guys.
> Regards
> Jörg
> 
> [1] http://www.cipa.jp/english/hyoujunka/kikaku/cipa_e_kikaku_list.html
>