You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by Apache Wiki <wi...@apache.org> on 2013/03/07 04:07:24 UTC

[Tika Wiki] Update of "MetadataDiscussion" by domtheo

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.

The "MetadataDiscussion" page has been changed by domtheo:
http://wiki.apache.org/tika/MetadataDiscussion?action=diff&rev1=2&rev2=3

  This page has been created to host a discussion on how Tika returns metadata for different kinds of documents. The goal is to make sure that Tika users have a chance to get to all of the metadata created and/or extracted by Tika.
  
  == Original Problem ==
- The original inspiration for this page was a Tika user who wanted to get access to the metadata for every document in an archive (e.g. zip, tar.gz, etc.). A way to get recursive metadata is described in the RecursiveMetadata article.
+ The original inspiration for this page was a Tika user who wanted to get access to the metadata for every document in an archive (e.g. zip, tar.gz, etc.). A way to get [[http://www.propertykita.com/rumah.html|Rumah Dijual]] recursive metadata is described in the RecursiveMetadata article [[http://vamostech.com/gps-tracking|GPS Tracker]] and [[http://www.pedatimotor.com|Aksesoris Sparepart Motor]].
  
  == Goals for this Page ==
  The goals for this page are bigger than the original problem. This page should hold a discussion about how to better meet different metadata needs for the different kinds of documents supported by Tika, and for the different kinds of users supported by Tika.
@@ -69, +69 @@

  When I first started using Tika, I had the naive dream that I could point the AutoDetectParser at anything and it would automatically find the document boundaries that matter to me and make everything I consider a single document look like the following:
  
  {{{
- <html xmlns="http://www.w3.org/1999/xhtml">
+ <html xmlns
    <head>
      <title>...</title>
      <thismeta>...</thismeta>
@@ -91, +91 @@

  == A Slightly Less Naive Non-Solution ==
  This solution is like the first naive solution, except it uses legal XHTML
  {{{
- <html xmlns="http://www.w3.org/1999/xhtml">
+ <html xmlns=
    <head>
      <title>...</title>
      <meta name="description" content="Example XHTML" />
@@ -112, +112 @@

  == Div Sections: No Place for Metadata ==
  The first two non-solutions ignored that decisions have already been made about how Tika will represent structured documents and simple containers in XHTML. Tika represents a simple container document something like the following:
  {{{
- <html xmlns="http://www.w3.org/1999/xhtml">
+ <html xmlns
    <head>
      <title>...</title>
    </head>
@@ -136, +136 @@

  
  The problem is that there is no place to put the metadata that is legal XHTML. The {{{<meta>}}} tags can only appear in the {{{<head>}}} section. Even if we wanted to put all metadata in the {{{<head>}}} section, doing so would mean that Tika could not stream the XHTML events, and instead of have to parse entire containers in two passes: once to gather the metadata, and a second time to output all of the text.
  
- If XHTML had a way to specify arbitrary name-value pairs somewhere in the {{{<div>}}} section, that could be used as a place to associate metadata with a {{{<div>}}} section. As far as I can tell from the specification [http://www.w3schools.com/tags/tag_div.asp] there isn't a place for arbitrary name-value pairs.
+ If XHTML had a way to specify arbitrary name-value pairs somewhere in the {{{<div>}}} section, that could be used as a place to associate metadata with a {{{<div>}}} section. As far as I can tell from the specification there isn't a place for arbitrary name-value pairs.
  
  = Potential Solutions That Could Work =
  Hopefully we can find some solutions that actually work, and work for many kinds of users. It doesn't look like there is a way to represent metadata for nested sections or nested documents in XHTML, but there may be other ways to make metadata nested metadata available to some users.