You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Chris A. Mattmann (Commented) (JIRA)" <ji...@apache.org> on 2011/10/21 18:38:32 UTC
[jira] [Commented] (TIKA-759) Better handling of content type
metadata
[ https://issues.apache.org/jira/browse/TIKA-759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13132808#comment-13132808 ]
Chris A. Mattmann commented on TIKA-759:
----------------------------------------
+1 to this Jukka!
In OODT-ville, for many years we've had something called a "Profile", see:
http://svn.apache.org/repos/asf/oodt/trunk/profile/src/main/java/org/apache/oodt/profile/Profile.java
A Profile is a metadata description of a resource with 3 different sets of attributes:
* housekeeping information about the Profile (its ID, created time, etc.)
* information about the data that the Profile points to (this is the Dublin Core set of information + some mods, and is housed in the http://svn.apache.org/repos/asf/oodt/trunk/profile/src/main/java/org/apache/oodt/profile/ResourceAttributes.java file)
* domain-specific metadata, which we keep as a set of ProfileElements (housed in the http://svn.apache.org/repos/asf/oodt/trunk/profile/src/main/java/org/apache/oodt/profile/ProfileElement.java) and its sub-classes, RangedProfileElement.java and EnumeratedProfileElement.java. ProfileElements correspond to ISO-11179 style elements, with information about (e.g., valid values, ranges, min/max, etc.)
Not saying we should adopt the above. Our OODT stuff is bloated in some areas, and could be reduced, but just thought I'd pass it along for some inspiration! :-)
> Better handling of content type metadata
> ----------------------------------------
>
> Key: TIKA-759
> URL: https://issues.apache.org/jira/browse/TIKA-759
> Project: Tika
> Issue Type: Improvement
> Components: metadata, mime
> Reporter: Jukka Zitting
> Assignee: Jukka Zitting
> Priority: Minor
>
> Currently we use the "Content-Type" metadata key for storing (and looking up) the media type of a document. This is simple enough and works well especially with HTTP, but not too well in line with XMP or other metadata standards like Dublin Core. So as an improvement I propose the following:
> * Switch to "dc:format" as the standard metadata key for the content type
> * Keep the existing "Content-Type" key for backwards compatibility with existing clients
> * Make the Metadata class aware of such aliases
> * Add getFormat() and setFormat() utility methods to Metadata to simplify client code and to make the exact metadata key more of an implementation detail in Tika
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira