You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Grant Ingersoll <gs...@apache.org> on 2008/11/24 23:47:44 UTC

Metadata Namespaces

What do people think of adding some type of namespace to the Metadata  
attributes (Dublin, CC, etc.).   I think this would allow us to  
discern where the metadata came from.  For instance, in Solr, see https://issues.apache.org/jira/browse/SOLR-284?focusedCommentId=12650353 
#action_12650353

I can see doing this a few different ways:

1. Allow the user to pass in a String that gets prefixed to all  
metadata names, with a constructor like:
  public Metadata(String namespace){
     metadata = new HashMap<String, String[]>();
     this.namespace = namespace;
   }

and then anytime a key is needed, the namespace string is potentially  
prepended

2. Prefix the "core" attributes with "tika."

3. Prefix each sub-attribute appropriately, such as "dc.format" for  
the DublinCore Format attribute.

4. Combine 2 and 3.  We could try something a bit more involved to  
have a way to formally define it like tika.dc.format, such that I  
could know that this attribute is core to Tika, from Dublin Core and  
is named Format.  Thus, say Solr adds in it's own parser that for  
whatever reason isn't contrib'ed back to Tika (just an example, I  
don't have anything in mind) I could create it's metadata attribs as  
solr.foo.bar or however I want to do it.

The default, I believe, should still be to have no namespace, i.e. the  
empty string namespace.

-Grant

RE: Metadata Namespaces

Posted by Uwe Schindler <uw...@thetaphi.de>.
How about javax.xml.namespace.QName as keys?

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> From: Jukka Zitting [mailto:jukka.zitting@gmail.com]
> Sent: Wednesday, November 26, 2008 2:18 AM
> To: tika-dev@lucene.apache.org
> Subject: Re: Metadata Namespaces
> 
> Hi,
> 
> On Tue, Nov 25, 2008 at 4:12 PM, Grant Ingersoll <gs...@apache.org>
> wrote:
> > That sounds reasonable, except I'm weary of RDF and all the
> underpinnings
> > required, but if we are just using the URI stuff than maybe it won't be
> so
> > bad.  One of things I really like about Tika is it is lightweight.
> 
> Agreed. In Tika I'd only use the namespace URIs as plain strings with
> no extra machinery around them.
> 
> Another alternative that I like is forget the idea of explicit
> namespaces and simply use URIs as the metadata keys. This would be
> perfectly in line with Dublin Core and would work reasonably well also
> with RDF clients. As an added benefit we could adopt this approach by
> simply modifying a few constants.
> 
> BR,
> 
> Jukka Zitting


Re: Metadata Namespaces

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Tue, Nov 25, 2008 at 4:12 PM, Grant Ingersoll <gs...@apache.org> wrote:
> That sounds reasonable, except I'm weary of RDF and all the underpinnings
> required, but if we are just using the URI stuff than maybe it won't be so
> bad.  One of things I really like about Tika is it is lightweight.

Agreed. In Tika I'd only use the namespace URIs as plain strings with
no extra machinery around them.

Another alternative that I like is forget the idea of explicit
namespaces and simply use URIs as the metadata keys. This would be
perfectly in line with Dublin Core and would work reasonably well also
with RDF clients. As an added benefit we could adopt this approach by
simply modifying a few constants.

BR,

Jukka Zitting

Re: Metadata Namespaces

Posted by Grant Ingersoll <gs...@apache.org>.
That sounds reasonable, except I'm weary of RDF and all the  
underpinnings required, but if we are just using the URI stuff than  
maybe it won't be so bad.  One of things I really like about Tika is  
it is lightweight.


On Nov 24, 2008, at 7:28 PM, Jukka Zitting wrote:

> Hi,
>
> On Mon, Nov 24, 2008 at 11:47 PM, Grant Ingersoll  
> <gs...@apache.org> wrote:
>> What do people think of adding some type of namespace to the Metadata
>> attributes (Dublin, CC, etc.).   I think this would allow us to  
>> discern
>> where the metadata came from.
>
> Agreed. See TIKA-61 for an earlier proposal about this. I kind of
> prefer URIs over prefixes like "dc." as the namespacing mechanism, as
> that would make it easier to interoperate with things like XMP and
> RDF.
>
> In general I think there's still lot of work waiting for us in
> metadata handling. It's good to see some real use cases coming up to
> help us in that work.
>
> BR,
>
> Jukka Zitting



Re: Metadata Namespaces

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Mon, Nov 24, 2008 at 11:47 PM, Grant Ingersoll <gs...@apache.org> wrote:
> What do people think of adding some type of namespace to the Metadata
> attributes (Dublin, CC, etc.).   I think this would allow us to discern
> where the metadata came from.

Agreed. See TIKA-61 for an earlier proposal about this. I kind of
prefer URIs over prefixes like "dc." as the namespacing mechanism, as
that would make it easier to interoperate with things like XMP and
RDF.

In general I think there's still lot of work waiting for us in
metadata handling. It's good to see some real use cases coming up to
help us in that work.

BR,

Jukka Zitting