You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov> on 2015/11/03 19:02:18 UTC

Re: ISO 19115 as a metadata model for Tika?

I think having some specific patches of how this would look
would help to take it less away from the abstract and more
into the concrete area. I encourage you to try it out MartinD,
and see if there is a good overlap there.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: Martin Desruisseaux <ma...@geomatys.com>
Organization: Geomatys
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Tuesday, October 13, 2015 at 1:34 PM
To: "dev@tika.apache.org" <de...@tika.apache.org>
Subject: Re: ISO 19115 as a metadata model for Tika?

>Le 12/10/15 14:22, Nick Burch a écrit :
>> Currently, it's very easy for a new user of Tika to get the metadata
>> they want out, they can just fetch a simple string value to get
>> started with. You can, when you learn more, start getting more richly
>> typed values out, but the quickstart is simple. Some libraries make it
>> so that you have to learn the full rich metadata structure right from
>> the get-go, which causes problems for new users. Whatever we do to
>> help the power users, we need to not ruin it for the beginners!
>
>What would be the approach for more richly typed values? Would they be
>an extension of the current model, or a second model existing in
>parallel with the first one?
>
>
>> For the discussion on "what should a richer Tika metadata system be
>> based on", I think TIKA-1607 is where that is taking place, plus some
>> related threads on-list.
>
>Thanks for the link. TIKA-1607 seems to be about associating arbitrary
>java.lang.Object to property keys. But isn't a little bit opaque? I
>mean, if a user get an instance of a class that he doesn't know, how to
>extract information from it?
>
>
>> In the short term, if there are some key parts of that standard for
>> geospacial metadata that we don't currently handle, and could do
>> easily with the current setup, then we should raise a JIRA + get a
>> sample file + add the support
>
>Regarding ISO 19115 support, what seems the main question to me is how
>to handle a tree structure? The current Tika metadata structure seems to
>be like a Map<String,String[]> (please correct me if I'm wrong), while
>ISO 19115 is more like a Map<String,Node> where each Node can contains
>children nodes, thus forming a tree. The following example in Tika:
>
>    Creator…………………… Jon Smith
>    Publisher……………… A company
>    Title………………………… Anything
>
>would be in the ISO 19115 model (note how the creator and publisher are
>grouped under the same "responsible party" node):
>
>    Citation
>     ├─Title………………………………………………… Anything
>     └─Cited responsible party
>       [1]
>        ├─Role…………………………………………… Author
>        └─Individual
>           └─Name…………………………………… Jon Smith
>       [2]
>        ├─Role…………………………………………… Publisher
>        └─Organisation
>           └─Name…………………………………… A company
>
>The tree structure allows to put other information, like email address
>and phone numbers, without confusion about whether the address applies
>to the creator or to the publisher. Of course a flat structure could
>prefix property names (e.g. "creator_address", "publisher_address",
>etc.), but this would result in a lot of keys. For example ISO 19115
>defines 20 standard roles (resourceProvider, custodian, owner, user,
>distributor, originator, pointOfContact, principalInvestigator,
>processor, publisher, author, sponsor, coAuthor, collaborator, editor,
>mediator, rightsHolder, contributor, funder, stakeholder) and each of
>them can be associated to about 30 properties under the "Cited
>responsible party" node (name, positionName, phone, city,
>administrativeArea, postalCode, country, hoursOfService,
>contactInstruction, onlineResource, etc.). Does Tika would like to
>handle such amount of data, and if yes is a flat structure really
>appropriate?
>
>    Martin
>


Re: ISO 19115 as a metadata model for Tika?

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Thanks Martin. I’ll contact Gautham who did the original
ISO 19115 parser and see if he has time to take a look.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: Martin Desruisseaux <ma...@geomatys.com>
Organization: Geomatys
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Wednesday, November 4, 2015 at 11:33 AM
To: "dev@tika.apache.org" <de...@tika.apache.org>
Subject: Re: ISO 19115 as a metadata model for Tika?

>Hello Chris
>
>Le 03/11/15 19:02, Mattmann, Chris A (3980) a écrit :
>> I think having some specific patches of how this would look
>> would help to take it less away from the abstract and more
>> into the concrete area. I encourage you to try it out MartinD,
>> and see if there is a good overlap there.
>
>I attached to TIKA-443 a demo extracting some
>org.apache.tika.metadata.DublinCore properties from an
>org.opengis.metadata.Metadata object. This is not a patch that can be
>included in Tika however since I do not know how to integrate those
>properties in Tika (I would let this work to volunteers).
>
>This demo tries to give some tips about only one aspect of the
>discussion: adding an ISO 19115 parser in Tika. There is an other aspect
>of the discussion which is not covered by this demo: whether the Tika
>metadata model should be extended to support the richness of more
>complex models like ISO 19115.
>
>More specifically, if one look at the demo, we can see that there is
>many loops. "Identification" object can contain many "Citation", which
>in turn can contain many "ResponsibleParty", etc. For this demo I just
>mapped e.g. the title of the first "Identification" instance to the
>DublinCore's "title" property, then break the loop. Obviously
>information are lost, so the question is whether it is a goal for Tika
>to capture those information, or if they are considered too specific.
>
>If Tika chooses to capture such information, then a tree structure will
>become necessary. So a next question would be how to do that, if a "tree
>structure" and a "flat structure" should cohabit, etc. But we do not
>need to answer those questions now (a simple ISO 19115 parser mapping to
>the current Dublin Core properties could be done).
>
>    Martin
>
>


Re: ISO 19115 as a metadata model for Tika?

Posted by Martin Desruisseaux <ma...@geomatys.com>.
Hello Chris

Le 03/11/15 19:02, Mattmann, Chris A (3980) a écrit :
> I think having some specific patches of how this would look
> would help to take it less away from the abstract and more
> into the concrete area. I encourage you to try it out MartinD,
> and see if there is a good overlap there.

I attached to TIKA-443 a demo extracting some
org.apache.tika.metadata.DublinCore properties from an
org.opengis.metadata.Metadata object. This is not a patch that can be
included in Tika however since I do not know how to integrate those
properties in Tika (I would let this work to volunteers).

This demo tries to give some tips about only one aspect of the
discussion: adding an ISO 19115 parser in Tika. There is an other aspect
of the discussion which is not covered by this demo: whether the Tika
metadata model should be extended to support the richness of more
complex models like ISO 19115.

More specifically, if one look at the demo, we can see that there is
many loops. "Identification" object can contain many "Citation", which
in turn can contain many "ResponsibleParty", etc. For this demo I just
mapped e.g. the title of the first "Identification" instance to the
DublinCore's "title" property, then break the loop. Obviously
information are lost, so the question is whether it is a goal for Tika
to capture those information, or if they are considered too specific.

If Tika chooses to capture such information, then a tree structure will
become necessary. So a next question would be how to do that, if a "tree
structure" and a "flat structure" should cohabit, etc. But we do not
need to answer those questions now (a simple ISO 19115 parser mapping to
the current Dublin Core properties could be done).

    Martin