You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Jukka Zitting (Created) (JIRA)" <ji...@apache.org> on 2011/10/21 13:04:32 UTC

[jira] [Created] (TIKA-759) Better handling of content type metadata

Better handling of content type metadata
----------------------------------------

                 Key: TIKA-759
                 URL: https://issues.apache.org/jira/browse/TIKA-759
             Project: Tika
          Issue Type: Improvement
          Components: metadata, mime
            Reporter: Jukka Zitting
            Assignee: Jukka Zitting
            Priority: Minor


Currently we use the "Content-Type" metadata key for storing (and looking up) the media type of a document. This is simple enough and works well especially with HTTP, but not too well in line with XMP or other metadata standards like Dublin Core. So as an improvement I propose the following:

* Switch to "dc:format" as the standard metadata key for the content type
* Keep the existing "Content-Type" key for backwards compatibility with existing clients
* Make the Metadata class aware of such aliases
* Add getFormat() and setFormat() utility methods to Metadata to simplify client code and to make the exact metadata key more of an implementation detail in Tika

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-759) Better handling of content type metadata

Posted by "Chris A. Mattmann (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13132808#comment-13132808 ] 

Chris A. Mattmann commented on TIKA-759:
----------------------------------------

+1 to this Jukka!

In OODT-ville, for many years we've had something called a "Profile", see:

http://svn.apache.org/repos/asf/oodt/trunk/profile/src/main/java/org/apache/oodt/profile/Profile.java

A Profile is a metadata description of a resource with 3 different sets of attributes:

* housekeeping information about the Profile (its ID, created time, etc.)
* information about the data that the Profile points to (this is the Dublin Core set of information + some mods, and is housed in the http://svn.apache.org/repos/asf/oodt/trunk/profile/src/main/java/org/apache/oodt/profile/ResourceAttributes.java file)
* domain-specific metadata, which we keep as a set of ProfileElements (housed in the http://svn.apache.org/repos/asf/oodt/trunk/profile/src/main/java/org/apache/oodt/profile/ProfileElement.java) and its sub-classes, RangedProfileElement.java and EnumeratedProfileElement.java. ProfileElements correspond to ISO-11179 style elements, with information about (e.g., valid values, ranges, min/max, etc.)

Not saying we should adopt the above. Our OODT stuff is bloated in some areas, and could be reduced, but just thought I'd pass it along for some inspiration! :-)
                
> Better handling of content type metadata
> ----------------------------------------
>
>                 Key: TIKA-759
>                 URL: https://issues.apache.org/jira/browse/TIKA-759
>             Project: Tika
>          Issue Type: Improvement
>          Components: metadata, mime
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
>            Priority: Minor
>
> Currently we use the "Content-Type" metadata key for storing (and looking up) the media type of a document. This is simple enough and works well especially with HTTP, but not too well in line with XMP or other metadata standards like Dublin Core. So as an improvement I propose the following:
> * Switch to "dc:format" as the standard metadata key for the content type
> * Keep the existing "Content-Type" key for backwards compatibility with existing clients
> * Make the Metadata class aware of such aliases
> * Add getFormat() and setFormat() utility methods to Metadata to simplify client code and to make the exact metadata key more of an implementation detail in Tika

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira