You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by Jukka Zitting <ju...@gmail.com> on 2008/12/03 01:34:32 UTC

Normalize metadata to Dublin Core

Hi,

Currently Tika doesn't have any good guidelines on the semantics and
usage of metadata keys. Mostly we've just ended up with a few basic
keys like CONTENT_TYPE and a bunch of more or less inconsistently used
other keys. The result is that a client that currently wants to assign
any reasonable semantics to the extracted metadata needs to first
check the reported CONTENT_TYPE and use that to deduce the meanings of
the other available metadata keys based on documentation in [1].

This is not optimal. It should be up to the Tika parsers to interpret
the metadata available in the supported document types and map that as
well as possible to a single standard like Dublin Core. This way a
client only needs to know a single set of metadata semantics.

The parser can still make the raw underlying metadata available using
metadata keys that are specific to the actual metadata schema used in
the document type, but that should be considered an extra feature
beyond the normalized Dublin Core output.

One corollary of this is that we should replace the current HTTP-based
CONTENT_TYPE metadata key with the Dublin Core FORMAT.

WDYT?

[1] http://lucene.apache.org/tika/formats.html

BR,

Jukka Zitting

RE: Normalize metadata to Dublin Core

Posted by Uwe Schindler <uw...@thetaphi.de>.

Hi Chris,

> > There is another way to map the string-only variant: QName.valueOf().
> This
> > static method parses the QName in the same way, like XSL-Parameters are
> > given as string to stylesheets. Not-namespaced QNames are generated the
> same
> > way like above, so the DC variant could also be generated this way using
> a
> > string only: QName.valueOf("{http://purl.org/dc/elements/1.0}format").
> The
> > ns-prefix is uninteresting, because QNames do not need one without a XML
> > parser content (the URI identifies the namespace alone). This is why
> > equals() and hashCode() only use namespace and localName.
> 
> The problem with what you are suggesting is that it implies that we know
> (in
> the String only case) that there is an implicit ns of
> "http://purl.org/dc/elements/1.0" attached to format. In the String only
> case, we aren't always guaranteed this (b/c we still want to support
> arbitrarily defined keys as well.

Yes, but the valueOf-Case would simply generate a QName without namespace,
what would be correctly behaviour.

For the pre-defined constants (e.g. Metadata.CONTENT_TYPE) we could simply
map them to QNames (which makes source codes compatible with the new API,
but not binary compatible), or just redefine the consants as
"{namespace}name", that completely uses the old API (but makes use on
QName.valueOf() in the old code). The latter change would be 100% backwards
compatible, as long as there are no hard-coded string constants in parsers
(which would then appear without namespace).

> I will file a JIRA issue and prepare a patch to upgrade o.a.t.m.Metadata
> to
> use a Map<QName,List<String>> internally and to keep existing API
> compatibility. Then you can let me know if and how it meets your needs.

Fine!

> Also, as an FYI, my name is "Chris", not "Matt".

Sorry, it was in the morning and I was still half-sleeping :-)

Uwe

Re: Normalize metadata to Dublin Core

Posted by "Mattmann, Chris A" <ch...@jpl.nasa.gov>.

Hi Uwe,

On 12/7/08 11:02 PM, "Uwe Schindler" <uw...@thetaphi.de> wrote:

> Hi Matt,
> ...
>
> There is another way to map the string-only variant: QName.valueOf(). This
> static method parses the QName in the same way, like XSL-Parameters are
> given as string to stylesheets. Not-namespaced QNames are generated the same
> way like above, so the DC variant could also be generated this way using a
> string only: QName.valueOf("{http://purl.org/dc/elements/1.0}format"). The
> ns-prefix is uninteresting, because QNames do not need one without a XML
> parser content (the URI identifies the namespace alone). This is why
> equals() and hashCode() only use namespace and localName.

The problem with what you are suggesting is that it implies that we know (in
the String only case) that there is an implicit ns of
"http://purl.org/dc/elements/1.0" attached to format. In the String only
case, we aren't always guaranteed this (b/c we still want to support
arbitrarily defined keys as well.

I will file a JIRA issue and prepare a patch to upgrade o.a.t.m.Metadata to
use a Map<QName,List<String>> internally and to keep existing API
compatibility. Then you can let me know if and how it meets your needs.

Also, as an FYI, my name is "Chris", not "Matt".

Thanks!

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

RE: Normalize metadata to Dublin Core

Posted by Uwe Schindler <uw...@thetaphi.de>.

Hi Matt,

> QNames as met keys sounds like a very interesting proposal, provided we
> still allow for the simplistic met API of simply putting String keys as
> well. So, we shouldn't require that the only mechanism for adding new met
> keys is only via Qnames -- we should support this -- but in addition we
> should also support (for backwards compatibility and simplicity/ease of
> use)
> allowing metadata to be added using String keys as well. So, this should
> still work:
> 
> Metadata met = new Metadata();
> met.addMetadata("format", "val1");
> met.addMetadata("format", "val2");
> 
> And, this should work:
> 
> Metadata met = new Metadata();
> met.addMetadata(new QName("http://purl.org/
> dc/elements/1.0","format","dc"),"val1");
> met.addMetadata(new QName("http://purl.org/
> dc/elements/1.0","format","dc"),"val2");
> 
> To support the simple case, we could have the method:
> 
> Metadata#addMetadata(String,String)
> 
> Simply be a wrapper around
> 
> Metadata#addMetadata(QName,String)
> 
> like:
> 
> public void addMetadata(String key, String val){
>   this.addMetadata(new QName(key), val);
> }
> 

There is another way to map the string-only variant: QName.valueOf(). This
static method parses the QName in the same way, like XSL-Parameters are
given as string to stylesheets. Not-namespaced QNames are generated the same
way like above, so the DC variant could also be generated this way using a
string only: QName.valueOf("{http://purl.org/dc/elements/1.0}format"). The
ns-prefix is uninteresting, because QNames do not need one without a XML
parser content (the URI identifies the namespace alone). This is why
equals() and hashCode() only use namespace and localName.

Uwe

Re: Normalize metadata to Dublin Core

Posted by "Mattmann, Chris A" <ch...@jpl.nasa.gov>.

Hi Uwe,

QNames as met keys sounds like a very interesting proposal, provided we
still allow for the simplistic met API of simply putting String keys as
well. So, we shouldn't require that the only mechanism for adding new met
keys is only via Qnames -- we should support this -- but in addition we
should also support (for backwards compatibility and simplicity/ease of use)
allowing metadata to be added using String keys as well. So, this should
still work:

Metadata met = new Metadata();
met.addMetadata("format", "val1");
met.addMetadata("format", "val2");

And, this should work:

Metadata met = new Metadata();
met.addMetadata(new QName("http://purl.org/
dc/elements/1.0","format","dc"),"val1");
met.addMetadata(new QName("http://purl.org/
dc/elements/1.0","format","dc"),"val2");

To support the simple case, we could have the method:

Metadata#addMetadata(String,String)

Simply be a wrapper around

Metadata#addMetadata(QName,String)

like:

public void addMetadata(String key, String val){
  this.addMetadata(new QName(key), val);
}

WDTY?

At a later time, if we really find no one uses the String keys, we can
deprecate the methods and then remove them from the API...

Cheers,
Chris




On 12/3/08 1:20 AM, "Uwe Schindler" <us...@pangaea.de> wrote:

> Hi Jukka,
>
> I like this.
>
> For the implementation (I noted this also in the corresponding JIRA issue):
> How about using QNames as Keys in the metadata map (e.g. Map<QName,
> String>)? For the standard metadata entries from Dublin core, that are
> "mandatory" for all parsers like Title (the current constants like
> CONTENT_TYPE), we could simply redefine the constants as QNames with the DC
> namespace-URI and maybe a prefix [but the prefix is not used in QNames, it's
> only there for reference, equals and hashcode does not use it. QNames are
> simply pairs of (URI,Name) [ This makes QNames very elegant. This would make
> most parsers automatically source-compatible. The parsers needing update are
> the ones, that use plain Strings as keys.
>
> -----
> UWE SCHINDLER
> Webserver/Middleware Development
> PANGAEA - Publishing Network for Geoscientific and Environmental Data
> MARUM - University of Bremen
> Room 2500, Leobener Str., D-28359 Bremen
> Tel.: +49 421 218 65595
> Fax:  +49 421 218 65505
> http://www.pangaea.de/
> E-mail: uschindler@pangaea.de
>
>> -----Original Message-----
>> From: Jukka Zitting [mailto:jukka.zitting@gmail.com]
>> Sent: Wednesday, December 03, 2008 1:35 AM
>> To: tika-dev@lucene.apache.org
>> Subject: Normalize metadata to Dublin Core
>>
>> Hi,
>>
>> Currently Tika doesn't have any good guidelines on the semantics and
>> usage of metadata keys. Mostly we've just ended up with a few basic
>> keys like CONTENT_TYPE and a bunch of more or less inconsistently used
>> other keys. The result is that a client that currently wants to assign
>> any reasonable semantics to the extracted metadata needs to first
>> check the reported CONTENT_TYPE and use that to deduce the meanings of
>> the other available metadata keys based on documentation in [1].
>>
>> This is not optimal. It should be up to the Tika parsers to interpret
>> the metadata available in the supported document types and map that as
>> well as possible to a single standard like Dublin Core. This way a
>> client only needs to know a single set of metadata semantics.
>>
>> The parser can still make the raw underlying metadata available using
>> metadata keys that are specific to the actual metadata schema used in
>> the document type, but that should be considered an extra feature
>> beyond the normalized Dublin Core output.
>>
>> One corollary of this is that we should replace the current HTTP-based
>> CONTENT_TYPE metadata key with the Dublin Core FORMAT.
>>
>> WDYT?
>>
>> [1] http://lucene.apache.org/tika/formats.html
>>
>> BR,
>>
>> Jukka Zitting
>
>

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

RE: Normalize metadata to Dublin Core

Posted by Uwe Schindler <us...@pangaea.de>.

Hi Jukka,

I like this.

For the implementation (I noted this also in the corresponding JIRA issue):
How about using QNames as Keys in the metadata map (e.g. Map<QName,
String>)? For the standard metadata entries from Dublin core, that are
"mandatory" for all parsers like Title (the current constants like
CONTENT_TYPE), we could simply redefine the constants as QNames with the DC
namespace-URI and maybe a prefix [but the prefix is not used in QNames, it's
only there for reference, equals and hashcode does not use it. QNames are
simply pairs of (URI,Name) [ This makes QNames very elegant. This would make
most parsers automatically source-compatible. The parsers needing update are
the ones, that use plain Strings as keys.

-----
UWE SCHINDLER
Webserver/Middleware Development
PANGAEA - Publishing Network for Geoscientific and Environmental Data
MARUM - University of Bremen
Room 2500, Leobener Str., D-28359 Bremen
Tel.: +49 421 218 65595
Fax:  +49 421 218 65505
http://www.pangaea.de/
E-mail: uschindler@pangaea.de

> -----Original Message-----
> From: Jukka Zitting [mailto:jukka.zitting@gmail.com]
> Sent: Wednesday, December 03, 2008 1:35 AM
> To: tika-dev@lucene.apache.org
> Subject: Normalize metadata to Dublin Core
> 
> Hi,
> 
> Currently Tika doesn't have any good guidelines on the semantics and
> usage of metadata keys. Mostly we've just ended up with a few basic
> keys like CONTENT_TYPE and a bunch of more or less inconsistently used
> other keys. The result is that a client that currently wants to assign
> any reasonable semantics to the extracted metadata needs to first
> check the reported CONTENT_TYPE and use that to deduce the meanings of
> the other available metadata keys based on documentation in [1].
> 
> This is not optimal. It should be up to the Tika parsers to interpret
> the metadata available in the supported document types and map that as
> well as possible to a single standard like Dublin Core. This way a
> client only needs to know a single set of metadata semantics.
> 
> The parser can still make the raw underlying metadata available using
> metadata keys that are specific to the actual metadata schema used in
> the document type, but that should be considered an extra feature
> beyond the normalized Dublin Core output.
> 
> One corollary of this is that we should replace the current HTTP-based
> CONTENT_TYPE metadata key with the Dublin Core FORMAT.
> 
> WDYT?
> 
> [1] http://lucene.apache.org/tika/formats.html
> 
> BR,
> 
> Jukka Zitting

Re: Normalize metadata to Dublin Core

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Wed, Dec 3, 2008 at 9:37 AM, Stephane Bastian
<st...@gmail.com> wrote:
> While this certainly sounds like a very good idea, it will be difficult to
> settle on using solely  a single metadata format in Tika. Dublin Core is one
> of several metadata format available, and while it is certainly suitable for
> some documents (word, excel, open document and such), it's not a silver
> bullet. for instance when it comes to images, audio and others, it is fairly
> limited and we've got almost no choice than describing the metadata in
> another format than Dublin Core (for instance we could use something like
> this http://www.metadataworkinggroup.com/pdf/mwg_guidance.pdf )

Using Dublin Core as the standard does not mean that we couldn't
_also_ use other more specific metadata schemas where appropriate.

The basic metadata use case is just knowing the type, name,
descriptive title, and perhaps the author of the document. This we can
do with Dublin Core for all documents where such basic metadata is
available, and my point is that a client that only ever cares about
such basic things shouldn't need to worry about different metadata
schemas for different types of documents.

Also, for things like images we should settle for some common image
metadata schema so that a client that only cares about basic image
things like resolution, depth, etc. doesn't need to have complex logic
to determine which metadata keys it should use to get to such
information.

> What is important for me though is that Tika Parsers should never extract
> meta-data using a key that doesn't belong to a known format as it make it
> difficult to use the data.

It's IMHO fine to use such novel keys when there is no standard
metadata schema that covers such information.

BR,

Jukka Zitting

Re: Normalize metadata to Dublin Core

Posted by Stephane Bastian <st...@gmail.com>.

Hi Jukka,

my 2 cents on this:

While this certainly sounds like a very good idea, it will be difficult 
to settle on using solely  a single metadata format in Tika. Dublin Core 
is one of several metadata format available, and while it is certainly 
suitable for some documents (word, excel, open document and such), it's 
not a silver bullet. for instance when it comes to images, audio and 
others, it is fairly limited and we've got almost no choice than 
describing the metadata in another format than Dublin Core (for instance 
we could use something like this 
http://www.metadataworkinggroup.com/pdf/mwg_guidance.pdf )

What is important for me though is that Tika Parsers should never 
extract meta-data using a key that doesn't belong to a known format as 
it make it difficult to use the data.

BR,

Stephane Bastian

Jukka Zitting wrote:
> Hi,
>
> Currently Tika doesn't have any good guidelines on the semantics and
> usage of metadata keys. Mostly we've just ended up with a few basic
> keys like CONTENT_TYPE and a bunch of more or less inconsistently used
> other keys. The result is that a client that currently wants to assign
> any reasonable semantics to the extracted metadata needs to first
> check the reported CONTENT_TYPE and use that to deduce the meanings of
> the other available metadata keys based on documentation in [1].
>
> This is not optimal. It should be up to the Tika parsers to interpret
> the metadata available in the supported document types and map that as
> well as possible to a single standard like Dublin Core. This way a
> client only needs to know a single set of metadata semantics.
>
> The parser can still make the raw underlying metadata available using
> metadata keys that are specific to the actual metadata schema used in
> the document type, but that should be considered an extra feature
> beyond the normalized Dublin Core output.
>
> One corollary of this is that we should replace the current HTTP-based
> CONTENT_TYPE metadata key with the Dublin Core FORMAT.
>
> WDYT?
>
> [1] http://lucene.apache.org/tika/formats.html
>
> BR,
>
> Jukka Zitting
>

Re: Normalize metadata to Dublin Core

Posted by Robert Burrell Donkin <ro...@gmail.com>.

On Sun, Dec 7, 2008 at 11:22 AM, Jukka Zitting <ju...@gmail.com> wrote:
> Hi,
>
> On Wed, Dec 3, 2008 at 1:05 PM, Robert Burrell Donkin
> <ro...@gmail.com> wrote:
>> should be simple enough to support minimal subclassing eg
>> tika:content-type -> dc:format
>
> We could do that, but what's the use case?

i'm thinking mainly of automated or computer-assisted cases

> The primary use case I'm thinking of is having a clear set of metadata
> fields that I can easily map to specific fields in a search index. For
> this use case it doesn't really matter what metadata schema we use as
> long as it's clear enough and we are consistent in using it

yes

> (e.g. all
> dc:format values produced by Tika would be MIME types, all dates of a
> specific format, etc.).

let's assume that - when used with tika - dc:format is implicitly
subclassed as media-type (a MIME attribute of content-type). dc:format
is a well used vocabulary. the difficult class of documents are those
that have both MIME type and dc:format meta-data but these are
unequal.

> A secondary use case is being able to easily use those fields when
> integrating with external metadata-aware applications. Here I think
> Dublin Core is the best alternative as I believe it's the most widely
> used and best understood (relatively speaking) metadata schema there
> is.

DC is imprecise and so difficult to work with. the semantic web crowd
now seem to prefer more precise schema which are more suitable for
automated reasoners.

> Currently I don't see where using subclasses or alternative schemas
> would bring enough value to counter the added complexity, but I'd be
> happy be proven wrong.

DC is not a rich vocabulary. taking a look at
http://lucene.apache.org/tika/formats.html for Microsoft's OLE 2
Compound Document format we have (i've tried to figure out some
mappings to dc in square brackets)

    * TITLE  Title [dc:creator]
    * SUBJECT Subject  [dc:subject? dc:description? dc:abstract?]
    * AUTHOR Author  [dc:creator? -> dc:contributor?]
    * KEYWORDS Keywords -> [dc:subject]
    * COMMENTS Comments [?]
    * TEMPLATE Template [?]
    * LAST_SAVED_BY Last Saved By [?]
    * REVISION_NUMBER Revision Number [?]
    * LAST_PRINTED Last Printed [?]
    * LAST_SAVED Last Saved Time/Date [dc:date?]
    * PAGE_COUNT Number of Pages [?]
    * WORD_COUNT Number of Words [?]
    * CHARACTER_COUNT Number of Characters [?]
    * APPLICATION_NAME Name of Creating Application [dc:creator?]

it's probably possible to find reasonable DC mappings for some of the
rest. some look like concepts which aren't really covered. note also
that there are a number of exact meta-data attribute which may
reasonably be mapped into more general dc attributes eg. dc:date (for
example, LAST_SAVED, LAST_PRINTED are both subclasses of dc:date)

the main use case i have in mind for synonyms is indexed searching. in
particular, being able to drill down facets. consider searches on a
large hetrogeneous body of documents.  tika has been used to extract
the meta-data which is then stored and indexed. my use case is
searching on date, so map to dc:date. this is a good search that
should pick all all manner of documents which had all manner of
changes on that date. when there are too many results for me to
browse, i want to be able to drill down the subclass which have
members in the documents retrieved. this allows both general searches
(on top level synonyms) and more precise ones (lower level classes).

for indexing, every time the index is run the synonyms would need to
be used to generate derived meta-data from the original set.

- robert

Re: Normalize metadata to Dublin Core

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Wed, Dec 3, 2008 at 1:05 PM, Robert Burrell Donkin
<ro...@gmail.com> wrote:
> should be simple enough to support minimal subclassing eg
> tika:content-type -> dc:format

We could do that, but what's the use case?

The primary use case I'm thinking of is having a clear set of metadata
fields that I can easily map to specific fields in a search index. For
this use case it doesn't really matter what metadata schema we use as
long as it's clear enough and we are consistent in using it (e.g. all
dc:format values produced by Tika would be MIME types, all dates of a
specific format, etc.).

A secondary use case is being able to easily use those fields when
integrating with external metadata-aware applications. Here I think
Dublin Core is the best alternative as I believe it's the most widely
used and best understood (relatively speaking) metadata schema there
is.

Currently I don't see where using subclasses or alternative schemas
would bring enough value to counter the added complexity, but I'd be
happy be proven wrong.

BR,

Jukka Zitting

Re: Normalize metadata to Dublin Core

Posted by Robert Burrell Donkin <ro...@gmail.com>.

On Wed, Dec 3, 2008 at 11:33 AM, Jukka Zitting <ju...@gmail.com> wrote:
> Hi,
>
> On Wed, Dec 3, 2008 at 9:32 AM, Robert Burrell Donkin
> <ro...@gmail.com> wrote:
>> there are lots of good ways which CONTENT_TYPE could be represented eg
>> http://www.w3.org/Protocols/rfc2616/rfc2616.html#content-type or
>> http://dbpedia.org/page/Content-Type or
>> http://dublincore.org/2008/01/14/dcelements.rdf#format. the most
>> precise meaning is http://lucene.apache.org/tika/content_type. the
>> rest are just synonyms, and some more subjective than others.
>> different users may prefer different choices.
>
> Yeah, been there done that. :-) Getting your head around all the
> semantic details of different metadata schemas and making your content
> consistently use one of them is major work, and I'd rather do as much
> of that in Tika as possible so I won't need to reimplement it in each
> client application.
>
> My proposal is that we choose one widely used metadata schema as the
> standard in Tika and try to use it as consistently as possible in all
> our parsers. Even with it's limitations Dublin Core seems like the
> best alternative for us to use.

one of the problems i found with dublin core is that it's important to
adhere to what are often quite specific definitions suitable mostly
for librarians but are also wide enough to be hard to interpret.
content-type has a good definition in HTTP but DC format could contain
about anything. this makes it hard to parse. content-type is a
subclass of format.

DC core is also limited in it's expressiveness: simile and other
people tend to prefer DBPedia which is wider and allows more precision

>> this suggests - to me at least - that some minimal support would be
>> useful for deductive ontologies. (in the same way, the namespacing
>> gives minimal support for RDF.) for example, a user may ask for
>> http://dublincore.org/2008/01/14/dcelements.rdf#format but this
>> meta-data property may be absent but
>> http://lucene.apache.org/tika/content_type is present, and is a
>> subclass of http://dublincore.org/2008/01/14/dcelements.rdf#format .
>> so, that value is returned.
>
> There be dragons down that path...

yep :-)

<flame-proof-boots>
should be simple enough to support minimal subclassing eg
tika:content-type -> dc:format

i found that (in RAT) coding information like this in java turned out
to be a bad idea. probably a text configuration would be better with a
canonical version shipped with the software. people can then easily
contribute new mappings back.
</flame-proof-boots>

- robert

Re: Normalize metadata to Dublin Core

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Wed, Dec 3, 2008 at 9:32 AM, Robert Burrell Donkin
<ro...@gmail.com> wrote:
> there are lots of good ways which CONTENT_TYPE could be represented eg
> http://www.w3.org/Protocols/rfc2616/rfc2616.html#content-type or
> http://dbpedia.org/page/Content-Type or
> http://dublincore.org/2008/01/14/dcelements.rdf#format. the most
> precise meaning is http://lucene.apache.org/tika/content_type. the
> rest are just synonyms, and some more subjective than others.
> different users may prefer different choices.

Yeah, been there done that. :-) Getting your head around all the
semantic details of different metadata schemas and making your content
consistently use one of them is major work, and I'd rather do as much
of that in Tika as possible so I won't need to reimplement it in each
client application.

My proposal is that we choose one widely used metadata schema as the
standard in Tika and try to use it as consistently as possible in all
our parsers. Even with it's limitations Dublin Core seems like the
best alternative for us to use.

> this suggests - to me at least - that some minimal support would be
> useful for deductive ontologies. (in the same way, the namespacing
> gives minimal support for RDF.) for example, a user may ask for
> http://dublincore.org/2008/01/14/dcelements.rdf#format but this
> meta-data property may be absent but
> http://lucene.apache.org/tika/content_type is present, and is a
> subclass of http://dublincore.org/2008/01/14/dcelements.rdf#format .
> so, that value is returned.

There be dragons down that path...

BR,

Jukka Zitting

Re: Normalize metadata to Dublin Core

Posted by Robert Burrell Donkin <ro...@gmail.com>.

On Wed, Dec 3, 2008 at 12:34 AM, Jukka Zitting <ju...@gmail.com> wrote:
> Hi,
>
> Currently Tika doesn't have any good guidelines on the semantics and
> usage of metadata keys. Mostly we've just ended up with a few basic
> keys like CONTENT_TYPE and a bunch of more or less inconsistently used
> other keys. The result is that a client that currently wants to assign
> any reasonable semantics to the extracted metadata needs to first
> check the reported CONTENT_TYPE and use that to deduce the meanings of
> the other available metadata keys based on documentation in [1].
>
> This is not optimal. It should be up to the Tika parsers to interpret
> the metadata available in the supported document types and map that as
> well as possible to a single standard like Dublin Core. This way a
> client only needs to know a single set of metadata semantics.
>
> The parser can still make the raw underlying metadata available using
> metadata keys that are specific to the actual metadata schema used in
> the document type, but that should be considered an extra feature
> beyond the normalized Dublin Core output.
>
> One corollary of this is that we should replace the current HTTP-based
> CONTENT_TYPE metadata key with the Dublin Core FORMAT.
>
> WDYT?

like the idea :-)

but it gets more interesting once you move away from the the basics

there are lots of good ways which CONTENT_TYPE could be represented eg
http://www.w3.org/Protocols/rfc2616/rfc2616.html#content-type or
http://dbpedia.org/page/Content-Type or
http://dublincore.org/2008/01/14/dcelements.rdf#format. the most
precise meaning is http://lucene.apache.org/tika/content_type. the
rest are just synonyms, and some more subjective than others.
different users may prefer different choices.

this suggests - to me at least - that some minimal support would be
useful for deductive ontologies. (in the same way, the namespacing
gives minimal support for RDF.) for example, a user may ask for
http://dublincore.org/2008/01/14/dcelements.rdf#format but this
meta-data property may be absent but
http://lucene.apache.org/tika/content_type is present, and is a
subclass of http://dublincore.org/2008/01/14/dcelements.rdf#format .
so, that value is returned.

- robert

Re: Normalize metadata to Dublin Core

Posted by "Mattmann, Chris A" <ch...@jpl.nasa.gov>.

Hi Jukka,

On 12/2/08 4:34 PM, "Jukka Zitting" <ju...@gmail.com> wrote:

> Hi,
>
> Currently Tika doesn't have any good guidelines on the semantics and
> usage of metadata keys. Mostly we've just ended up with a few basic
> keys like CONTENT_TYPE and a bunch of more or less inconsistently used
> other keys. The result is that a client that currently wants to assign
> any reasonable semantics to the extracted metadata needs to first
> check the reported CONTENT_TYPE and use that to deduce the meanings of
> the other available metadata keys based on documentation in [1].

This is really only true of any sub-classes of o.a.t.parser.CompositeParser.
There is no enforcing mechanism that this be the case. In, fact, on the
contrary, it's possible to implement another o.a.t.parser.Parser subclass
that has entirely different semantics. Sure, by doing this you really don't
take advantage of tika-config.xml, and it's associated auto-goodness, but
that's the whole point of making the Parser an interface, to allow folks to
adhere to the lowest common denominator standard.

>
> This is not optimal. It should be up to the Tika parsers to interpret
> the metadata available in the supported document types and map that as
> well as possible to a single standard like Dublin Core. This way a
> client only needs to know a single set of metadata semantics.

I'm not sure of the relationship between the fact that CompositeParsers use
metadata for CONTENT_TYPE to determine which underlying CompositeParser
subclass to call, and that of the metadata standard adhered to by the
underlying parser.

It seems like you are suggesting that o.a.t.parser.Parsers should declare
what met semantics and std vocabulary (or vocabularies) they adhere to, so
as to know, e.g., if you can pipeline together different parsers, and take
advantage of their output.

This is an interesting proposition because if we go down this path, we are
now starting to get into the realm of data flow dependencies, and then we
have to start thinking about parsing workflows and how Tika can support
them. I think declaring things like required InputMetadata, and declaring
provided output met semantics and vocabularies would be a very interesting
and useful contribution to Tika.

However, I want to point out, that it seems that this is entirely
independent of the CompositeParser.

>
> The parser can still make the raw underlying metadata available using
> metadata keys that are specific to the actual metadata schema used in
> the document type, but that should be considered an extra feature
> beyond the normalized Dublin Core output.

To me, while adhering to Dublin Core is great, and provides standardization,
we shouldn't enforce Dublin Core as the _only_ output met vocabulary. In
fact, we should, as noted above, have several output met vocabularies, and
perhaps, have all the o.a.t.parser.Parsers declare them.

>
> One corollary of this is that we should replace the current HTTP-based
> CONTENT_TYPE metadata key with the Dublin Core FORMAT.

In what context? Could you be more specific?

Thanks,
 Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.