You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by Leo Sauermann <le...@gnowsis.com> on 2010/11/14 10:13:35 UTC

RecursiveMetadata and MetadataDiscussion - some long-term input

Hi Tika,
(cc Aperture, just fyi)

I stumbled upon
http://wiki.apache.org/tika/MetadataDiscussion
and
http://wiki.apache.org/tika/RecursiveMetadata


The problems don't stop there,
if you think it through you end up with zip-files containing zip-files
containing .pst and email files containing attached word documents
containing embedded excel.

In the sourceforge project "Aperture" (its similar to Tika) the solution
was to use the W3C standard RDF which allows endlessly stacking
information into each other. This was also used in the NEPOMUK-KDE linux
implementation, but there in C++ and with a slightly different angle to it.

it may be useful to check out their documentation and their status of
dicussion:

the data model:
http://www.semanticdesktop.org/ontologies/

this is the specific model of stacking things into each other:
http://www.semanticdesktop.org/ontologies/2007/01/19/nie/

the stacking/recursive problem was solved using "subcrawlers":
http://sourceforge.net/apps/trac/aperture/wiki/SubCrawlers

general structure of things coming together:
http://sourceforge.net/apps/trac/aperture/wiki/GeneralStructure


>From my experience (I am co-author and was initiator of most of the
above) there is only a limited short-term benefit of adopting this
thinking, but a bigger long-term benefit as being compatible with
RDF/W3C will on the long turn make Tika compatible with what happens in
HTML5 and other standardization efforts.
Looking at this stuff could help as a guideline for decisions in Tika.


So - Could anyone please think about it for a minute and add these links
and some ideas how to deal with it to
http://wiki.apache.org/tika/MetadataDiscussion
and
http://wiki.apache.org/tika/RecursiveMetadata
?


best
Leo Sauermann, Dr.
CEO and Founder

p.s.
There used to be a much closer tie between tika and aperture in 2007,
but as Aperture development is kind of finished (its in production now
at some places and fixes only done when needed) it seems communication
between them has lowered a bit. Anyone knows why?


mail: leo.sauermann@gnowsis.com
mobile: +43 6991 gnowsis
http://www.gnowsis.com

helping people remember,

so join our newsletter
http://www.gnowsis.com/about/content/newsletter
____________________________________________________

RE: RecursiveMetadata and MetadataDiscussion - some long-term input - if you need RDF call xesam or aperture

Posted by Jukka Zitting <jz...@adobe.com>.

Hi,

From: Leo Sauermann [mailto:leo.sauermann@gnowsis.com]
> RDF is the only cross-format standard out there, there are standardized
> representations in XML, JSON, HTML, and databases. That would make it a
> good fit for frameworks, such as Tika.

Agreed. The idea of using XMP (a metadata model based on RDF) has come up every now and then on dev@tika (see the archives), and I think that's what we should be working towards. Note however that the scope of Tika has at least so far been intentionally smaller than that of Aperture.

For example, we explicitly don't try to preserve the full structural or semantic details of parsed documents. Thus the points about mapping VCARD or ICAL data to RDF are somewhat irrelevant for Tika, as we'd just map such data to semi-structured XHTML whose main purpose is to support full text indexing or other unstructured text processing applications. In other words, Tika is lossy by design.

Another point, more related to recursive metadata, is that we make no attempt at defining a representation for compound documents. The rationale for this is that such representations are necessarily application- or domain-specific. Tika avoids making those design choices by having the Parser API only recognize singular documents, but allowing programmatic access to subdocuments through the EmbeddedDocumentExtractor (or the more general ParseContext) mechanism. A client application can use these tools to construct any kind of hierarchical metadata structures.

To summarize: yes, I think RDF is a good idea for Tika, but only in terms of extending our metadata model to XMP. I don't see how RDF would be more useful than XHTML in representing the full text content of a document; at least as long as we're not looking at radically extending the scope of Tika.

BR,

Jukka Zitting

Re: RecursiveMetadata and MetadataDiscussion - some long-term input - if you need RDF call xesam or aperture

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.

Thanks Leo, appreciate the discussion.

Cheers,
Chris

On 11/15/10 7:15 AM, "Leo Sauermann" <le...@gnowsis.com> wrote:

Hi,

ok, good feedback, thanks for taking the time to answer.

I feel an urge to take an "RDF vs JSON/..." discussion off-list, as I
have seen this discussion since 1999. btw, RSS meant "RDF site
syndication"... so RDF>RSS

but its an important discussion - so - more input

RDF is the only cross-format standard out there, there are standardized
representations in XML, JSON, HTML, and databases. That would make it a
good fit for frameworks, such as Tika.

of course, the 120 minutes it takes to learn RDF are longer than the 10
minutes it takes to learn JSON. My experience was, that for data
integration projects, the 110 minutes pay off.

I guess thats the reason why Facebook and Google dig RDF now... it is
the only proper way to let data flow from databases out to the web and
back into other databases.
(thats what google now supports with price databases and the RDF-based
"GoodRelations" ecommerce SEO format)

if the consensus within Tika is - "rdf is too complex for us, we don't
need it", that's fine.
It took Sebastian Trüg about a year of discussion in the KDE
mailinglists to explain why RDF is better suited for data integration in
document indexing until the KDE people were convinced to switch the
system search engine to RDF.

some points:
Inference - please ignore this, you don't need it.

Field definition - you will soon have a problem in TIKA when you want to
crawl VCARD and ICAL files and extract the full richness of ALL data
embedded in those formats. Here RDF helped Aperture a lot.
So for the whole area of Types and their Fields and subfields and
hierarchical fields, RDF could help.

XML - whatever, RDF is serialization-agnostic. It works best in internal
APIs I guess, where data should flow from one component to another
without being reformatted.

Lets see it the other way round ?

if you need info why RDF is better than anything else (ho ho ho), call
the Aperture-dev mailinglist, people there are eager to help I guess.

Grant Ingersoll used to hang out over at the Aperture-Dev mailinglist

if this is ok, I would cease this thread now from my side and say: if
the question pops up, get in touch with Aperture or KDE people.

if there is a need to get inspired, aperture people are there to help.

I would guess the same is said for the KDE linux desktop indexing
writers. There they also use RDF as format and there is an overarching
standardization effort (OSCAF.org) amongst all of us.... that could also
be a place to discuss, we had around a million eur spent just discussing
about those RDF data formats (ontologies) that are now running ;-)

I cc Sebastian Trüg in this mail, he is the main developer and
boss-of-ontologies at KDE. I guess that Tika people are welcome to check
out what happens on the KDE/Gnome side in the "Xesame" mailinglist.
There is (not enough) documentation here whom to ask in case of questions:
http://sourceforge.net/apps/trac/oscaf/wiki/Communication
http://sourceforge.net/apps/trac/oscaf/wiki/Ontologies

best
Leo

It was Mattmann, Chris A (388J) who said at the right time 14.11.2010
17:48 the following words:
> Thanks Leo, we'll take a look.
>
> FYI, one of the goals of Tika is to be extremely light-weight, and to
> provide canonical metadata representation, independent of any
> particular "view" of metadata, which in my mind RDF is as much of as
> e.g., RSS, or FGDC, JSON, etc., or any one of the myriad views out
> there. Sure it comes with inference, and all of the other promised
> goodies, but in my experience, I've seen little real use of those in
> data management systems. I've seen more use of RDF as a nice, compact
> XML format to represent metadata and allow interchange than anything
> else. I'd be opposed to making it the standard in Tika though, as I
> said b/c to me it's just a view.
>
> Regardless, thanks for reaching out and I have a number of downstream
> ideas for helping Tika become more useful for showing different
> metadata "views" as I call them and plan on starting to
> implement/contribute some of them in the coming year, as soon as this
> book [1] starts to wrap up :) I think a number of other Tika
> community members have been doing a fantastic job at keeping the
> metadata capabilities in Tika simple, light-weight, and feature-rich,
> and I expect it to continue down that path.
>
> Cheers, Chris
>
> [1] http://www.manning.com/mattmann/
>
> On 11/14/10 1:13 AM, "Leo Sauermann" <le...@gnowsis.com>
> wrote:
>
> Hi Tika, (cc Aperture, just fyi)
>
> I stumbled upon http://wiki.apache.org/tika/MetadataDiscussion and
> http://wiki.apache.org/tika/RecursiveMetadata
>
>
> The problems don't stop there, if you think it through you end up
> with zip-files containing zip-files containing .pst and email files
> containing attached word documents containing embedded excel.
>
> In the sourceforge project "Aperture" (its similar to Tika) the
> solution was to use the W3C standard RDF which allows endlessly
> stacking information into each other. This was also used in the
> NEPOMUK-KDE linux implementation, but there in C++ and with a
> slightly different angle to it.
>
> it may be useful to check out their documentation and their status
> of dicussion:
>
> the data model: http://www.semanticdesktop.org/ontologies/
>
> this is the specific model of stacking things into each other:
> http://www.semanticdesktop.org/ontologies/2007/01/19/nie/
>
> the stacking/recursive problem was solved using "subcrawlers":
> http://sourceforge.net/apps/trac/aperture/wiki/SubCrawlers
>
> general structure of things coming together:
> http://sourceforge.net/apps/trac/aperture/wiki/GeneralStructure
>
>
> From my experience (I am co-author and was initiator of most of the
> above) there is only a limited short-term benefit of adopting this
> thinking, but a bigger long-term benefit as being compatible with
> RDF/W3C will on the long turn make Tika compatible with what happens
> in HTML5 and other standardization efforts. Looking at this stuff
> could help as a guideline for decisions in Tika.
>
>
> So - Could anyone please think about it for a minute and add these
> links and some ideas how to deal with it to
> http://wiki.apache.org/tika/MetadataDiscussion and
> http://wiki.apache.org/tika/RecursiveMetadata ?
>
>
> best Leo Sauermann, Dr. CEO and Founder
>
> p.s. There used to be a much closer tie between tika and aperture in
> 2007, but as Aperture development is kind of finished (its in
> production now at some places and fixes only done when needed) it
> seems communication between them has lowered a bit. Anyone knows
> why?
>
>
> mail: leo.sauermann@gnowsis.com mobile: +43 6991 gnowsis
> http://www.gnowsis.com
>
> helping people remember,
>
> so join our newsletter
> http://www.gnowsis.com/about/content/newsletter
> ____________________________________________________
>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion
> Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop:
> 171-246 Email: Chris.Mattmann@jpl.nasa.gov WWW:
> http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department University
> of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>

--
Leo Sauermann, Dr.
CEO and Founder

mail: leo.sauermann@gnowsis.com
mobile: +43 6991 gnowsis
http://www.gnowsis.com

helping people remember,

so join our newsletter
http://www.gnowsis.com/about/content/newsletter
____________________________________________________

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: RecursiveMetadata and MetadataDiscussion - some long-term input - if you need RDF call xesam or aperture

Posted by Leo Sauermann <le...@gnowsis.com>.

Hi,

ok, good feedback, thanks for taking the time to answer.

I feel an urge to take an "RDF vs JSON/..." discussion off-list, as I
have seen this discussion since 1999. btw, RSS meant "RDF site
syndication"... so RDF>RSS

but its an important discussion - so - more input

RDF is the only cross-format standard out there, there are standardized
representations in XML, JSON, HTML, and databases. That would make it a
good fit for frameworks, such as Tika.

of course, the 120 minutes it takes to learn RDF are longer than the 10
minutes it takes to learn JSON. My experience was, that for data
integration projects, the 110 minutes pay off.

I guess thats the reason why Facebook and Google dig RDF now... it is
the only proper way to let data flow from databases out to the web and
back into other databases.
(thats what google now supports with price databases and the RDF-based
"GoodRelations" ecommerce SEO format)

if the consensus within Tika is - "rdf is too complex for us, we don't
need it", that's fine.
It took Sebastian Trüg about a year of discussion in the KDE
mailinglists to explain why RDF is better suited for data integration in
document indexing until the KDE people were convinced to switch the
system search engine to RDF.

some points:
Inference - please ignore this, you don't need it.

Field definition - you will soon have a problem in TIKA when you want to
crawl VCARD and ICAL files and extract the full richness of ALL data
embedded in those formats. Here RDF helped Aperture a lot.
So for the whole area of Types and their Fields and subfields and
hierarchical fields, RDF could help.

XML - whatever, RDF is serialization-agnostic. It works best in internal
APIs I guess, where data should flow from one component to another
without being reformatted.

Lets see it the other way round ?

if you need info why RDF is better than anything else (ho ho ho), call
the Aperture-dev mailinglist, people there are eager to help I guess.

Grant Ingersoll used to hang out over at the Aperture-Dev mailinglist

if this is ok, I would cease this thread now from my side and say: if
the question pops up, get in touch with Aperture or KDE people.

if there is a need to get inspired, aperture people are there to help.

I would guess the same is said for the KDE linux desktop indexing
writers. There they also use RDF as format and there is an overarching
standardization effort (OSCAF.org) amongst all of us.... that could also
be a place to discuss, we had around a million eur spent just discussing
about those RDF data formats (ontologies) that are now running ;-)

I cc Sebastian Trüg in this mail, he is the main developer and
boss-of-ontologies at KDE. I guess that Tika people are welcome to check
out what happens on the KDE/Gnome side in the "Xesame" mailinglist.
There is (not enough) documentation here whom to ask in case of questions:
http://sourceforge.net/apps/trac/oscaf/wiki/Communication
http://sourceforge.net/apps/trac/oscaf/wiki/Ontologies

best
Leo

It was Mattmann, Chris A (388J) who said at the right time 14.11.2010
17:48 the following words:
> Thanks Leo, we'll take a look.
> 
> FYI, one of the goals of Tika is to be extremely light-weight, and to
> provide canonical metadata representation, independent of any
> particular "view" of metadata, which in my mind RDF is as much of as
> e.g., RSS, or FGDC, JSON, etc., or any one of the myriad views out
> there. Sure it comes with inference, and all of the other promised
> goodies, but in my experience, I've seen little real use of those in
> data management systems. I've seen more use of RDF as a nice, compact
> XML format to represent metadata and allow interchange than anything
> else. I'd be opposed to making it the standard in Tika though, as I
> said b/c to me it's just a view.
> 
> Regardless, thanks for reaching out and I have a number of downstream
> ideas for helping Tika become more useful for showing different
> metadata "views" as I call them and plan on starting to
> implement/contribute some of them in the coming year, as soon as this
> book [1] starts to wrap up :) I think a number of other Tika
> community members have been doing a fantastic job at keeping the
> metadata capabilities in Tika simple, light-weight, and feature-rich,
> and I expect it to continue down that path.
> 
> Cheers, Chris
> 
> [1] http://www.manning.com/mattmann/
> 
> On 11/14/10 1:13 AM, "Leo Sauermann" <le...@gnowsis.com>
> wrote:
> 
> Hi Tika, (cc Aperture, just fyi)
> 
> I stumbled upon http://wiki.apache.org/tika/MetadataDiscussion and 
> http://wiki.apache.org/tika/RecursiveMetadata
> 
> 
> The problems don't stop there, if you think it through you end up
> with zip-files containing zip-files containing .pst and email files
> containing attached word documents containing embedded excel.
> 
> In the sourceforge project "Aperture" (its similar to Tika) the
> solution was to use the W3C standard RDF which allows endlessly
> stacking information into each other. This was also used in the
> NEPOMUK-KDE linux implementation, but there in C++ and with a
> slightly different angle to it.
> 
> it may be useful to check out their documentation and their status
> of dicussion:
> 
> the data model: http://www.semanticdesktop.org/ontologies/
> 
> this is the specific model of stacking things into each other: 
> http://www.semanticdesktop.org/ontologies/2007/01/19/nie/
> 
> the stacking/recursive problem was solved using "subcrawlers": 
> http://sourceforge.net/apps/trac/aperture/wiki/SubCrawlers
> 
> general structure of things coming together: 
> http://sourceforge.net/apps/trac/aperture/wiki/GeneralStructure
> 
> 
> From my experience (I am co-author and was initiator of most of the 
> above) there is only a limited short-term benefit of adopting this 
> thinking, but a bigger long-term benefit as being compatible with 
> RDF/W3C will on the long turn make Tika compatible with what happens
> in HTML5 and other standardization efforts. Looking at this stuff
> could help as a guideline for decisions in Tika.
> 
> 
> So - Could anyone please think about it for a minute and add these
> links and some ideas how to deal with it to 
> http://wiki.apache.org/tika/MetadataDiscussion and 
> http://wiki.apache.org/tika/RecursiveMetadata ?
> 
> 
> best Leo Sauermann, Dr. CEO and Founder
> 
> p.s. There used to be a much closer tie between tika and aperture in
> 2007, but as Aperture development is kind of finished (its in
> production now at some places and fixes only done when needed) it
> seems communication between them has lowered a bit. Anyone knows
> why?
> 
> 
> mail: leo.sauermann@gnowsis.com mobile: +43 6991 gnowsis 
> http://www.gnowsis.com
> 
> helping people remember,
> 
> so join our newsletter 
> http://www.gnowsis.com/about/content/newsletter 
> ____________________________________________________
> 
> 
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 
> Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion
> Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop:
> 171-246 Email: Chris.Mattmann@jpl.nasa.gov WWW:
> http://sunset.usc.edu/~mattmann/ 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 
> Adjunct Assistant Professor, Computer Science Department University
> of Southern California, Los Angeles, CA 90089 USA 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 
> 

-- 
Leo Sauermann, Dr.
CEO and Founder

mail: leo.sauermann@gnowsis.com
mobile: +43 6991 gnowsis
http://www.gnowsis.com

helping people remember,

so join our newsletter
http://www.gnowsis.com/about/content/newsletter
____________________________________________________

Re: RecursiveMetadata and MetadataDiscussion - some long-term input

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.

Thanks Leo, we'll take a look.

FYI, one of the goals of Tika is to be extremely light-weight, and to provide canonical metadata representation, independent of any particular "view" of metadata, which in my mind RDF is as much of as e.g., RSS, or FGDC, JSON, etc., or any one of the myriad views out there. Sure it comes with inference, and all of the other promised goodies, but in my experience, I've seen little real use of those in data management systems. I've seen more use of RDF as a nice, compact XML format to represent metadata and allow interchange than anything else. I'd be opposed to making it the standard in Tika though, as I said b/c to me it's just a view.

Regardless, thanks for reaching out and I have a number of downstream ideas for helping Tika become more useful for showing different metadata "views" as I call them and plan on starting to implement/contribute some of them in the coming year, as soon as this book [1] starts to wrap up :) I think a number of other Tika community members have been doing a fantastic job at keeping the metadata capabilities in Tika simple, light-weight, and feature-rich, and I expect it to continue down that path.

Cheers,
Chris

[1] http://www.manning.com/mattmann/

On 11/14/10 1:13 AM, "Leo Sauermann" <le...@gnowsis.com> wrote:

Hi Tika,
(cc Aperture, just fyi)

I stumbled upon
http://wiki.apache.org/tika/MetadataDiscussion
and
http://wiki.apache.org/tika/RecursiveMetadata


The problems don't stop there,
if you think it through you end up with zip-files containing zip-files
containing .pst and email files containing attached word documents
containing embedded excel.

In the sourceforge project "Aperture" (its similar to Tika) the solution
was to use the W3C standard RDF which allows endlessly stacking
information into each other. This was also used in the NEPOMUK-KDE linux
implementation, but there in C++ and with a slightly different angle to it.

it may be useful to check out their documentation and their status of
dicussion:

the data model:
http://www.semanticdesktop.org/ontologies/

this is the specific model of stacking things into each other:
http://www.semanticdesktop.org/ontologies/2007/01/19/nie/

the stacking/recursive problem was solved using "subcrawlers":
http://sourceforge.net/apps/trac/aperture/wiki/SubCrawlers

general structure of things coming together:
http://sourceforge.net/apps/trac/aperture/wiki/GeneralStructure


>From my experience (I am co-author and was initiator of most of the
above) there is only a limited short-term benefit of adopting this
thinking, but a bigger long-term benefit as being compatible with
RDF/W3C will on the long turn make Tika compatible with what happens in
HTML5 and other standardization efforts.
Looking at this stuff could help as a guideline for decisions in Tika.


So - Could anyone please think about it for a minute and add these links
and some ideas how to deal with it to
http://wiki.apache.org/tika/MetadataDiscussion
and
http://wiki.apache.org/tika/RecursiveMetadata
?


best
Leo Sauermann, Dr.
CEO and Founder

p.s.
There used to be a much closer tie between tika and aperture in 2007,
but as Aperture development is kind of finished (its in production now
at some places and fixes only done when needed) it seems communication
between them has lowered a bit. Anyone knows why?


mail: leo.sauermann@gnowsis.com
mobile: +43 6991 gnowsis
http://www.gnowsis.com

helping people remember,

so join our newsletter
http://www.gnowsis.com/about/content/newsletter
____________________________________________________



++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++