You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Martin Desruisseaux <ma...@geomatys.com> on 2015/10/12 12:27:50 UTC

ISO 19115 as a metadata model for Tika?

Hello all

In the last ApacheConf in Budapest, we had some discussion about
geospatial metadata in Tika. Currently Tika has 3 properties (latitude,
longitude, altitude) in its org.apache.tika.metadata.Geographic
interface, also reproduced in the TikeCoreProperties interface.
Geospatial metadata can be more complex, but does Tika wishes to support
more geospatial metadata structures or to keep that model simple?

If Tika wishes to support geospatial metadata more extensively, would
Tika consider to use the ISO 19115 metadata model? This international
standard is the official metadata model of the Open Geospatial
Consortium (OGC) and is in use in various organisations (some parts of
NASA, European Space Agency, Food and Agriculture Organisation, etc.).
The ISO 19115 standard is quite big, with about 500 properties.

ISO 19115 could be a format like any other formats in Tika. One possible
way for Tika to read and write ISO 19115 documents in XML would be to
use Apache Spatial Information System. The Maven dependency would be:

    <dependency>
     <groupId>org.apache.sis.core</groupId>
     <artifactId>sis-metadata</artifactId>
     <version>0.6</version>
    </dependency>

And the code can be (there is a more generic API working also with
NetCDF files, be we can leave that for later):

    import org.apache.sis.XML;
    import org.opengis.metadata.Metadata;

    ...

    Metadata metadata = (Metadata) XML.parse(URL);

The above Metadata object is the root of a tree. It may have many titles
for different things (a title for the data, a title for the quality
evaluation procedure, etc.), many authors, many variables in the
dataset, etc. One possible problem is that ISO 19115 metadata requires a
tree structure, while in my understanding Tika metadata are currently
stored in a flat structure. Does Tika plans to support a tree structure?
Would it be a pre-requite before Tika can support ISO 19115?

An other question is related to the fact that while officially a
geospatial metadata standard, ISO 19115 is actually a much more generic
metadata standard with some geospatial parts in it. In my understanding
ISO 19115 contains most of Dublin Core, together with many of the
properties currently provided in various Tika interfaces. ISO 19115
could potentially replace many org.apache.tika.metadata interfaces with
a single consistent model. I presume that such replacement would not be
possible for compatibility reasons, and maybe also for complexity
reasons. But I would be curious to know if Tika has some plan for the
evolution of its metadata model?

    Regards,

        Martin



Re: ISO 19115 as a metadata model for Tika?

Posted by Martin Desruisseaux <ma...@geomatys.com>.
Le 15/10/15 13:21, Nick Burch a écrit :
> Tika doesn't only use Dublin Core. Tika uses about half a dozen
> well-known externally defined metadata models (meta-metadata?). Dublin
> core is one of those, but certainly not the only one.

Yes, but in my understanding this is a juxtaposition of many models.
Some bigger (but admittedly more complex) standards like ISO 19115
provide a single consistent model for what is currently splitted in many
models in Tika. I'm not saying that Tika should change (it would be a
never ending story since we could always find yet bigger models, and it
may not be possible to find a model that please to every communities).
I'm just trying to see how those bigger models could fit in Tika picture.


> We rely on external definitions to explain what a metadata key
> represents, and the better known that definition the easier it is for
> our users. We then have the parsers map from their format-specific
> metadata onto the most appropriate well-known key.

Yes, the sis-metadata module works in the same way, except that it maps
only to OGC/ISO 191xx keys.


> Whatever we do, it needs to be easy for people to work out what they
> want, and what something means. If they have to read a many hundred
> page ISO standard to figure it out, we've failed!

Understood, this is where come the question about multiple models. In my
understanding, in some sense Tika currently provides a single model even
if it come from multiple external definitions. For example if someone
wants a date from a XMP file, he needs to use the Dublin core key rather
than the XMP key. But if someone is more familiar with ISO 19115 than
Dublin core, then the above approach could increase the complexity for
him because that user who need to know two models instead of one, and to
remember which ISO 19115 properties need to be accessed by the Dublin
key rather than the ISO key.

An alternative could be to allow the same property to be accessed by two
(or more) keys. Those keys would be defined by different standards
co-existing in Tika. Tika would not provide a model for each data
format, but only for a very small set of well recognized standards (e.g.
2 or 3). The Tika parsers would map their metadata to the keys of the
standard model most appropriate to them, and Tika would take care of the
equivalence between e.g. Dublin core and ISO 19115.

Would it make sense?

    Martin



Re: ISO 19115 as a metadata model for Tika?

Posted by Nick Burch <ap...@gagravarr.org>.
On Thu, 15 Oct 2015, Martin Desruisseaux wrote:
> So I'm not looking for a solution to a technical problem, but I'm trying
> to learn more about the strategic direction that Tika wishes to take.
> Would Tika considers to move to a richer metadata model than Dublin
> core?

Tika doesn't only use Dublin Core. Tika uses about half a dozen well-known 
externally defined metadata models (meta-metadata?). Dublinc core is one 
of those, but certainly not the only one

We rely on external definitions to explain what a metadata key represents, 
and the better known that definition the easier it is for our users. We 
then have the parsers map from their format-specific metadata onto the 
most appropriate well-known key

(By most appropriate, one example is some of the XMP bits. XMP has it's 
own date metadata, but we don't use those. We instead use the better known 
Dublin Core properties for the dates, and only media-specific parts of 
XMP)

> Would ISO 19115 be considered too geospatial-centric (which I could 
> understand)? Would Tika supports more than one "universal model" if it 
> wants to preserve Dublin core simplicity with the richness of other 
> international standards?

As mentioned above, Tika already has multiple external definitions in use, 
but only one for each area

Whatever we do, it needs to be easy for people to work out what they want, 
and what something means. If they have to read a many hundred page ISO 
standard to figure it out, we've failed! Ditto if it becomes an epic 
battle to work out what a value is / how to decode it

Nick

Re: ISO 19115 as a metadata model for Tika?

Posted by Martin Desruisseaux <ma...@geomatys.com>.
Le 16/10/15 00:43, Nick Burch a écrit :
> Our current new-ish retrofitted model with properties (which offer
> both a simple string and richer typed values) covers most of those,
> but is struggling with the complex formats case.
>
> All the alternatives, including my own preferred (complexity on the
> key not value) have downsides and have issues on at least one of the
> above!
>
> I think we're all very keen to find out how other projects have
> tackled the same problem space, and how they've squared our circle...!

We tried to put tree structure in keys (rather than values) in another
project, but I abandoned that path because of its complexity. Since lot
of ISO 19115 properties can contains an arbitrary amount of values, we
had to define some selectors (e.g. "identificationInfo[1]/citation"
where "[1]" means "first element of the collection"). Then we have more
complex cases like "take the first date where date.type ==
DateType.CREATION" for fetching the creation date. We could use XPath,
but this is not very simple and can not cover all the needs.

For now, I did not yet found a good alternative to simple keys and
values as Plain Old Java Object. For those who need a more dynamic
approach, we provide views of each object as java.util.Map (using
reflection).

    Martin



RE: ISO 19115 as a metadata model for Tika?

Posted by Nick Burch <ap...@gagravarr.org>.
On Thu, 15 Oct 2015, Allison, Timothy B. wrote:
> Y, as I'm thinking more about c) (and note that this is a personal and 
> half-baked proposal, not at all speaking for the Tika community), we 
> could offer multiple models for advanced users.

One thing I'm keen to see is the ability to map on the output side back 
into other standards. Our XMP module is one such use of this. The JSON one 
is too, but less standard... Idea being that we have a common, sane, 
rich-enough internal representation, then people wanting XMP / ISO-19115 
/ etc can then transform the output Tika metadata onwards into their 
chosen format

> If someone wanted to contribute code that would represent metadata in 
> ISO 19115 for the appropriate parsers or if we could scrape ISO-19115 
> out of documents (as we might consider doing with XMP streams), the 
> advanced user could grab that node and go to town.  To emphasize Nick's 
> point, we absolutely want to keep the basics easy to get to.  No single 
> standard is likely to be sufficient for us, and yet, we also don't want 
> to create our very own.

There's also Giuseppe's work on input metadata, eg TIKA-1691, to allow 
richer mapping from input metadata onto our standards. Having helped give 
his talk in Budapest, with help from Michael from OODT/JPL, I more get 
this. Idea is that quite custom formats (eg PDFs from one specific 
conference) could say "grab this text as 'first name', that as 'second 
name', in our own custom metadata standard, then combine the two for 
dc:creator for everyone else". That probably works best in combination 
with some of the content -> metadata content handlers.

My view is that we need:
  * A model simple enough for beginners to understand and get started
  * Something flexible enough for advanced users to still utilise
  * Something that's consistent, as much as possible, between formats
  * Something with enough information (possibly hidden by default) to allow
    richly mapping out into other standards/systems
  * Something that works for "simple" office docs, but still copes with
    "complex" ones like media and scientific formats, without too much
    surprise of changes
  * Something that deals with the conflicts in the above ;-)

Our current new-ish retrofitted model with properties (which offer both a 
simple string and richer typed values) covers most of those, but is 
struggling with the complex formats case.

All the alternatives, including my own preferred (complexity on the key 
not value) have downsides and have issues on at least one of the above!

I think we're all very keen to find out how other projects have tackled 
the same problem space, and how they've squared our circle...!

Nick

RE: ISO 19115 as a metadata model for Tika?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
> So this email is for discussion only - not for immediate action.
Got it.  As you can see by TIKA-1607 and [0], this has been an ongoing and important discussion, and I appreciate your contributions...I'm not a standards person, and was interested to learn more about ISO 19115.

> But approach 1 or c suggests that different conceptual models (e.g. Dublin core versus ISO 19115) would co-exist.
Y, as I'm thinking more about c) (and note that this is a personal and half-baked proposal, not at all speaking for the Tika community), we could offer multiple models for advanced users.  If someone wanted to contribute code that would represent metadata in ISO 19115 for the appropriate parsers or if we could scrape ISO-19115 out of documents (as we might consider doing with XMP streams), the advanced user could grab that node and go to town.  To emphasize Nick's point, we absolutely want to keep the basics easy to get to.  No single standard is likely to be sufficient for us, and yet, we also don't want to create our very own.

Again, I can't emphasize enough the importance of Nick's point on keeping simple things simple.  As SOLR-7232 shows, even our current model is not being used correctly by very important consumers....I really need to get to work on that one...

Cheers,

              Tim


[0] http://wiki.apache.org/tika/MetadataRoadmap

-----Original Message-----
From: Martin Desruisseaux [mailto:martin.desruisseaux@geomatys.com] 
Sent: Thursday, October 15, 2015 6:10 AM
To: dev@tika.apache.org
Subject: Re: ISO 19115 as a metadata model for Tika?

Le 14/10/15 20:15, Allison, Timothy B. a écrit :
> On TIKA-1607, there are two (and a half) proposals:
> 1) move everything to DOM with helper classes for common elements
> 2) use POJOs as metadata values
> c) ;) keep current setup, perhaps add binary values, use DOM inputstreams for things that already have standards (e.g. Dublin core)  This could be a transitional step to option 1 in Tika 2.0.
>
> If we went with 1 or c) we could embed ISO 19115, we could either embed the info within the DOM or add an ISO DOM stream that would include this information.

Thanks for explaining. But approach 1 or c suggests that different conceptual models (e.g. Dublin core versus ISO 19115) would co-exist, regardless of the underlying data structure (DOM or something else), is that right? For example, if someone what to get the title of a document, does he would specify for example "I'm using the TITLE key from the Dublin core model" or "I'm using the IDENTIFICATION_INFO/CITATION/TITLE
key from the ISO 19115 model"? Or does Tika plans to propose its own "universal" model?


> (...snip...) However, once we move beyond Map<String, String[]> the 
> user is going to have to have some knowledge of the metadata structure 
> to extract information, whether that's POJO, DOM or Map<String, Node>.

Right, this is related to my question above. To avoid the need to know the metadata structure of a specific data format, Tika (in my
understanding) currently maps some metadata to the Dublin core model, which is used as a "universal" conceptual model. So anyone can ask for the title without knowing where the title is stored in various data formats.

However for some more advanced needs, the Dublin core model is not enough and can not easily be extended. A new conceptual model is needed.
ISO 19115 is one such conceptual model that could be used in replacement of Dublin core, but there is also other conceptual models that are yet more complex than ISO 191115. Is there some thoughts about what would be the compromise between simplicity and completeness in Tika 2?


> On your interest in ISO 19115, to echo Nick, what specifically do you need? What document formats do you see populating this information?

We do not need changes in Tika model at this time since Apache SIS has its own metadata engine (but targeting only geospatial data formats like NetCDF - no Word or PDF parsing - and using ISO 19115 as its "universal model" instead than Dublin core). But we have seen talks about geospatial metadata in Tika in recent ApacheConf, and I was a little bit worried to see that some proposed solutions (i.e. new properties) were Tika-specific instead than using international standards (note: I'm not suggesting to use Apache SIS - only to consider the international standard behind it).

So I'm not looking for a solution to a technical problem, but I'm trying to learn more about the strategic direction that Tika wishes to take.
Would Tika considers to move to a richer metadata model than Dublin core? Would ISO 19115 be considered too geospatial-centric (which I could understand)? Would Tika supports more than one "universal model"
if it wants to preserve Dublin core simplicity with the richness of other international standards?

About document formats populated with ISO 19115 metadata: standalone ISO
19115 files are provided by various data producers, for example 1) from NASA, 2) from the Spanish mapping agency or 3) from all French government agencies:

 1. http://podaac.jpl.nasa.gov/ws/metadata/dataset/?shortName=AVISO_L4_DYN_TOPO_1DEG_1MO&format=iso
 2. http://www.ign.es/csw-inspire/srv/spa/xml_iso19139?id=9584
 3. http://www.geocatalogue.fr/getMetadata?format=XML&id=1785

ISO 19115 information are also embedded in raster data like "GML in JPEG2000" standard. Equivalent information are embedded in NetCDF files and translated to the ISO 19115 model by tools like "ncISO" from NOAA/NGDC. I saw that Tika has an org.apache.tika.metadata.ClimateForcast interface, but it describes only the information at the root of NetCDF files without describing the variables included in those files (which would need a metadata tree structure).

So this email is for discussion only - not for immediate action.

    Regards,

        Martin



Re: ISO 19115 as a metadata model for Tika?

Posted by Martin Desruisseaux <ma...@geomatys.com>.
Le 14/10/15 20:15, Allison, Timothy B. a écrit :
> On TIKA-1607, there are two (and a half) proposals:
> 1) move everything to DOM with helper classes for common elements
> 2) use POJOs as metadata values
> c) ;) keep current setup, perhaps add binary values, use DOM inputstreams for things that already have standards (e.g. Dublin core)  This could be a transitional step to option 1 in Tika 2.0.
>
> If we went with 1 or c) we could embed ISO 19115, we could either embed the info within the DOM or add an ISO DOM stream that would include this information.

Thanks for explaining. But approach 1 or c suggests that different
conceptual models (e.g. Dublin core versus ISO 19115) would co-exist,
regardless of the underlying data structure (DOM or something else), is
that right? For example, if someone what to get the title of a document,
does he would specify for example "I'm using the TITLE key from the
Dublin core model" or "I'm using the IDENTIFICATION_INFO/CITATION/TITLE
key from the ISO 19115 model"? Or does Tika plans to propose its own
"universal" model?


> (...snip...) However, once we move beyond Map<String, String[]> the
> user is going to have to have some knowledge of the metadata structure
> to extract information, whether that's POJO, DOM or Map<String, Node>.

Right, this is related to my question above. To avoid the need to know
the metadata structure of a specific data format, Tika (in my
understanding) currently maps some metadata to the Dublin core model,
which is used as a "universal" conceptual model. So anyone can ask for
the title without knowing where the title is stored in various data formats.

However for some more advanced needs, the Dublin core model is not
enough and can not easily be extended. A new conceptual model is needed.
ISO 19115 is one such conceptual model that could be used in replacement
of Dublin core, but there is also other conceptual models that are yet
more complex than ISO 191115. Is there some thoughts about what would be
the compromise between simplicity and completeness in Tika 2?


> On your interest in ISO 19115, to echo Nick, what specifically do you need? What document formats do you see populating this information?

We do not need changes in Tika model at this time since Apache SIS has
its own metadata engine (but targeting only geospatial data formats like
NetCDF - no Word or PDF parsing - and using ISO 19115 as its "universal
model" instead than Dublin core). But we have seen talks about
geospatial metadata in Tika in recent ApacheConf, and I was a little bit
worried to see that some proposed solutions (i.e. new properties) were
Tika-specific instead than using international standards (note: I'm not
suggesting to use Apache SIS - only to consider the international
standard behind it).

So I'm not looking for a solution to a technical problem, but I'm trying
to learn more about the strategic direction that Tika wishes to take.
Would Tika considers to move to a richer metadata model than Dublin
core? Would ISO 19115 be considered too geospatial-centric (which I
could understand)? Would Tika supports more than one "universal model"
if it wants to preserve Dublin core simplicity with the richness of
other international standards?

About document formats populated with ISO 19115 metadata: standalone ISO
19115 files are provided by various data producers, for example 1) from
NASA, 2) from the Spanish mapping agency or 3) from all French
government agencies:

 1. http://podaac.jpl.nasa.gov/ws/metadata/dataset/?shortName=AVISO_L4_DYN_TOPO_1DEG_1MO&format=iso
 2. http://www.ign.es/csw-inspire/srv/spa/xml_iso19139?id=9584
 3. http://www.geocatalogue.fr/getMetadata?format=XML&id=1785

ISO 19115 information are also embedded in raster data like "GML in
JPEG2000" standard. Equivalent information are embedded in NetCDF files
and translated to the ISO 19115 model by tools like "ncISO" from
NOAA/NGDC. I saw that Tika has an
org.apache.tika.metadata.ClimateForcast interface, but it describes only
the information at the root of NetCDF files without describing the
variables included in those files (which would need a metadata tree
structure).

So this email is for discussion only - not for immediate action.

    Regards,

        Martin



RE: ISO 19115 as a metadata model for Tika?

Posted by "Allison, Timothy B." <ta...@mitre.org>.

>What would be the approach for more richly typed values? Would they be an extension of the current model, or a second >model existing in parallel with the first one?
On TIKA-1607, there are two (and a half) proposals:
1) move everything to DOM with helper classes for common elements
2) use POJOs as metadata values
c) ;) keep current setup, perhaps add binary values, use DOM inputstreams for things that already have standards (e.g. Dublin core)  This could be a transitional step to option 1 in Tika 2.0.

If we went with 1 or c) we could embed ISO 19115, we could either embed the info within the DOM or add an ISO DOM stream that would include this information. 


>Thanks for the link. TIKA-1607 seems to be about associating arbitrary java.lang.Object to property keys. But isn't a little bit opaque? I mean, if a user get an instance of a class that he doesn't know, how to extract information from it?

I agree with this on the one hand.  However, once we move beyond Map<String, String[]> the user is going to have to have some knowledge of the metadata structure to extract information, whether that's POJO, DOM or Map<String, Node>.


>Regarding ISO 19115 support, what seems the main question to me is how to handle a tree structure? 
Right, that's the crux of TIKA-1607.

On your interest in ISO 19115, to echo Nick, what specifically do you need? What document formats do you see populating this information?




Re: ISO 19115 as a metadata model for Tika?

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Thanks Martin. I’ll contact Gautham who did the original
ISO 19115 parser and see if he has time to take a look.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: Martin Desruisseaux <ma...@geomatys.com>
Organization: Geomatys
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Wednesday, November 4, 2015 at 11:33 AM
To: "dev@tika.apache.org" <de...@tika.apache.org>
Subject: Re: ISO 19115 as a metadata model for Tika?

>Hello Chris
>
>Le 03/11/15 19:02, Mattmann, Chris A (3980) a écrit :
>> I think having some specific patches of how this would look
>> would help to take it less away from the abstract and more
>> into the concrete area. I encourage you to try it out MartinD,
>> and see if there is a good overlap there.
>
>I attached to TIKA-443 a demo extracting some
>org.apache.tika.metadata.DublinCore properties from an
>org.opengis.metadata.Metadata object. This is not a patch that can be
>included in Tika however since I do not know how to integrate those
>properties in Tika (I would let this work to volunteers).
>
>This demo tries to give some tips about only one aspect of the
>discussion: adding an ISO 19115 parser in Tika. There is an other aspect
>of the discussion which is not covered by this demo: whether the Tika
>metadata model should be extended to support the richness of more
>complex models like ISO 19115.
>
>More specifically, if one look at the demo, we can see that there is
>many loops. "Identification" object can contain many "Citation", which
>in turn can contain many "ResponsibleParty", etc. For this demo I just
>mapped e.g. the title of the first "Identification" instance to the
>DublinCore's "title" property, then break the loop. Obviously
>information are lost, so the question is whether it is a goal for Tika
>to capture those information, or if they are considered too specific.
>
>If Tika chooses to capture such information, then a tree structure will
>become necessary. So a next question would be how to do that, if a "tree
>structure" and a "flat structure" should cohabit, etc. But we do not
>need to answer those questions now (a simple ISO 19115 parser mapping to
>the current Dublin Core properties could be done).
>
>    Martin
>
>


Re: ISO 19115 as a metadata model for Tika?

Posted by Martin Desruisseaux <ma...@geomatys.com>.
Hello Chris

Le 03/11/15 19:02, Mattmann, Chris A (3980) a écrit :
> I think having some specific patches of how this would look
> would help to take it less away from the abstract and more
> into the concrete area. I encourage you to try it out MartinD,
> and see if there is a good overlap there.

I attached to TIKA-443 a demo extracting some
org.apache.tika.metadata.DublinCore properties from an
org.opengis.metadata.Metadata object. This is not a patch that can be
included in Tika however since I do not know how to integrate those
properties in Tika (I would let this work to volunteers).

This demo tries to give some tips about only one aspect of the
discussion: adding an ISO 19115 parser in Tika. There is an other aspect
of the discussion which is not covered by this demo: whether the Tika
metadata model should be extended to support the richness of more
complex models like ISO 19115.

More specifically, if one look at the demo, we can see that there is
many loops. "Identification" object can contain many "Citation", which
in turn can contain many "ResponsibleParty", etc. For this demo I just
mapped e.g. the title of the first "Identification" instance to the
DublinCore's "title" property, then break the loop. Obviously
information are lost, so the question is whether it is a goal for Tika
to capture those information, or if they are considered too specific.

If Tika chooses to capture such information, then a tree structure will
become necessary. So a next question would be how to do that, if a "tree
structure" and a "flat structure" should cohabit, etc. But we do not
need to answer those questions now (a simple ISO 19115 parser mapping to
the current Dublin Core properties could be done).

    Martin



Re: ISO 19115 as a metadata model for Tika?

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
I think having some specific patches of how this would look
would help to take it less away from the abstract and more
into the concrete area. I encourage you to try it out MartinD,
and see if there is a good overlap there.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: Martin Desruisseaux <ma...@geomatys.com>
Organization: Geomatys
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Tuesday, October 13, 2015 at 1:34 PM
To: "dev@tika.apache.org" <de...@tika.apache.org>
Subject: Re: ISO 19115 as a metadata model for Tika?

>Le 12/10/15 14:22, Nick Burch a écrit :
>> Currently, it's very easy for a new user of Tika to get the metadata
>> they want out, they can just fetch a simple string value to get
>> started with. You can, when you learn more, start getting more richly
>> typed values out, but the quickstart is simple. Some libraries make it
>> so that you have to learn the full rich metadata structure right from
>> the get-go, which causes problems for new users. Whatever we do to
>> help the power users, we need to not ruin it for the beginners!
>
>What would be the approach for more richly typed values? Would they be
>an extension of the current model, or a second model existing in
>parallel with the first one?
>
>
>> For the discussion on "what should a richer Tika metadata system be
>> based on", I think TIKA-1607 is where that is taking place, plus some
>> related threads on-list.
>
>Thanks for the link. TIKA-1607 seems to be about associating arbitrary
>java.lang.Object to property keys. But isn't a little bit opaque? I
>mean, if a user get an instance of a class that he doesn't know, how to
>extract information from it?
>
>
>> In the short term, if there are some key parts of that standard for
>> geospacial metadata that we don't currently handle, and could do
>> easily with the current setup, then we should raise a JIRA + get a
>> sample file + add the support
>
>Regarding ISO 19115 support, what seems the main question to me is how
>to handle a tree structure? The current Tika metadata structure seems to
>be like a Map<String,String[]> (please correct me if I'm wrong), while
>ISO 19115 is more like a Map<String,Node> where each Node can contains
>children nodes, thus forming a tree. The following example in Tika:
>
>    Creator…………………… Jon Smith
>    Publisher……………… A company
>    Title………………………… Anything
>
>would be in the ISO 19115 model (note how the creator and publisher are
>grouped under the same "responsible party" node):
>
>    Citation
>     ├─Title………………………………………………… Anything
>     └─Cited responsible party
>       [1]
>        ├─Role…………………………………………… Author
>        └─Individual
>           └─Name…………………………………… Jon Smith
>       [2]
>        ├─Role…………………………………………… Publisher
>        └─Organisation
>           └─Name…………………………………… A company
>
>The tree structure allows to put other information, like email address
>and phone numbers, without confusion about whether the address applies
>to the creator or to the publisher. Of course a flat structure could
>prefix property names (e.g. "creator_address", "publisher_address",
>etc.), but this would result in a lot of keys. For example ISO 19115
>defines 20 standard roles (resourceProvider, custodian, owner, user,
>distributor, originator, pointOfContact, principalInvestigator,
>processor, publisher, author, sponsor, coAuthor, collaborator, editor,
>mediator, rightsHolder, contributor, funder, stakeholder) and each of
>them can be associated to about 30 properties under the "Cited
>responsible party" node (name, positionName, phone, city,
>administrativeArea, postalCode, country, hoursOfService,
>contactInstruction, onlineResource, etc.). Does Tika would like to
>handle such amount of data, and if yes is a flat structure really
>appropriate?
>
>    Martin
>


Re: ISO 19115 as a metadata model for Tika?

Posted by Martin Desruisseaux <ma...@geomatys.com>.
Le 12/10/15 14:22, Nick Burch a écrit :
> Currently, it's very easy for a new user of Tika to get the metadata
> they want out, they can just fetch a simple string value to get
> started with. You can, when you learn more, start getting more richly
> typed values out, but the quickstart is simple. Some libraries make it
> so that you have to learn the full rich metadata structure right from
> the get-go, which causes problems for new users. Whatever we do to
> help the power users, we need to not ruin it for the beginners!

What would be the approach for more richly typed values? Would they be
an extension of the current model, or a second model existing in
parallel with the first one?


> For the discussion on "what should a richer Tika metadata system be
> based on", I think TIKA-1607 is where that is taking place, plus some
> related threads on-list.

Thanks for the link. TIKA-1607 seems to be about associating arbitrary
java.lang.Object to property keys. But isn't a little bit opaque? I
mean, if a user get an instance of a class that he doesn't know, how to
extract information from it?


> In the short term, if there are some key parts of that standard for
> geospacial metadata that we don't currently handle, and could do
> easily with the current setup, then we should raise a JIRA + get a
> sample file + add the support

Regarding ISO 19115 support, what seems the main question to me is how
to handle a tree structure? The current Tika metadata structure seems to
be like a Map<String,String[]> (please correct me if I'm wrong), while
ISO 19115 is more like a Map<String,Node> where each Node can contains
children nodes, thus forming a tree. The following example in Tika:

    Creator…………………… Jon Smith
    Publisher……………… A company
    Title………………………… Anything 

would be in the ISO 19115 model (note how the creator and publisher are
grouped under the same "responsible party" node):

    Citation
     ├─Title………………………………………………… Anything
     └─Cited responsible party
       [1]
        ├─Role…………………………………………… Author
        └─Individual
           └─Name…………………………………… Jon Smith
       [2]
        ├─Role…………………………………………… Publisher
        └─Organisation
           └─Name…………………………………… A company

The tree structure allows to put other information, like email address
and phone numbers, without confusion about whether the address applies
to the creator or to the publisher. Of course a flat structure could
prefix property names (e.g. "creator_address", "publisher_address",
etc.), but this would result in a lot of keys. For example ISO 19115
defines 20 standard roles (resourceProvider, custodian, owner, user,
distributor, originator, pointOfContact, principalInvestigator,
processor, publisher, author, sponsor, coAuthor, collaborator, editor,
mediator, rightsHolder, contributor, funder, stakeholder) and each of
them can be associated to about 30 properties under the "Cited
responsible party" node (name, positionName, phone, city,
administrativeArea, postalCode, country, hoursOfService,
contactInstruction, onlineResource, etc.). Does Tika would like to
handle such amount of data, and if yes is a flat structure really
appropriate?

    Martin



Re: ISO 19115 as a metadata model for Tika?

Posted by Nick Burch <ap...@gagravarr.org>.
On Mon, 12 Oct 2015, Martin Desruisseaux wrote:
> In the last ApacheConf in Budapest, we had some discussion about
> geospatial metadata in Tika. Currently Tika has 3 properties (latitude,
> longitude, altitude) in its org.apache.tika.metadata.Geographic
> interface, also reproduced in the TikeCoreProperties interface.
> Geospatial metadata can be more complex, but does Tika wishes to support
> more geospatial metadata structures or to keep that model simple?

Both!

Currently, it's very easy for a new user of Tika to get the metadata they 
want out, they can just fetch a simple string value to get started with. 
You can, when you learn more, start getting more richly typed values out, 
but the quickstart is simple. Some libraries make it so that you have to 
learn the full rich metadata structure right from the get-go, which causes 
problems for new users. Whatever we do to help the power users, we need to 
not ruin it for the beginners!

> If Tika wishes to support geospatial metadata more extensively, would 
> Tika consider to use the ISO 19115 metadata model? This international 
> standard is the official metadata model of the Open Geospatial 
> Consortium (OGC) and is in use in various organisations (some parts of 
> NASA, European Space Agency, Food and Agriculture Organisation, etc.). 
> The ISO 19115 standard is quite big, with about 500 properties.

For the discussion on "what should a richer Tika metadata system be based 
on", I think TIKA-1607 is where that is taking place, plus some related 
threads on-list. If you have ideas/experiences/alternatives, especially 
ones which keep things beginner-friendly, please share them!

In the short term, if there are some key parts of that standard for 
geospacial metadata that we don't currently handle, and could do easily 
with the current setup, then we should raise a JIRA + get a sample file + 
add the support

Nick