You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by Apache Wiki <wi...@apache.org> on 2019/01/08 21:04:04 UTC
[Tika Wiki] Update of "TikaEvalAndStructuralComponents" by TimothyAllison

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.

The "TikaEvalAndStructuralComponents" page has been changed by TimothyAllison:
https://wiki.apache.org/tika/TikaEvalAndStructuralComponents

New page:
'''NOTE: THIS IS A PAGE IN PROGRESS AND SHOULD BE VIEWED AS A VERY ROUGH DRAFT''' 

= Evaluating Structural Components in Extracted Content with the tika-eval Module =

'''NOTE:''' This page assumes basic knowledge of the tika-eval workflow.  Please see TikaEval and make sure that you understand how the tika-eval modules works on text before considering structure.

File formats often contain structural or stylistic elements, and Apache Tika attempts to normalize and represent some of these features in its XHTML output.  As of Tika 1.20, users can get counts of common XHTML tags (in Profile mode) and/or comparison counts of common XHTML tags (in Compare mode).  Users can also count "tag exceptions" -- cases where the structure tags violate XML/XHTML requirements, e.g. `<b><i></b></i>`.

= Known Limitations =
 * Simply counting structure tags offers only a rudimentary insight into the structure of a single extract or as a comparison between two extracts of the same source file.  One might want to apply a more advanced tree-based similarity/distance metric between two extracts -- our JIRA is open and committers are standing by.
 * If one one tool's extracts have more `<p>` elements than do another tool's that doesn't necessarily tell you that one extract is better than another. 
 For example, one tool (Tool A) might add `<p>` elements for every new line in a PDF:
   `<p>The quick brown fox</p>`
   `<p>jumped over the lazy dog</p>`

  Another tool (Tool B ) might apply heuristics to reconstruct logical paragraphs, such as
   `<p>The quick brown fox jumped over the lazy dog. </p>`

Tool A would have more `<p>` tags, but Tool B is probably capturing better information about the structure of the document.
 
= Intended Uses/Scope = 

= How to Count Structural Components =

If you are using Tika to generate .json files, follow the directions on TikaEval for how to create a directory of extracts, but don't include the `-t` option: `java -jar tika-app.X.Y.jar -J -i input_dir -o extracts`.  This has the effect of storing the content that is extracted as XHTML, and it sets a metadata value of `ToXMLContentHandler` for the key `X-TIKA:content_handler`.  When tika-eval finds that value set in the metadata, it parses the XHTML with a SAXParser to count the structure tags and extract the text.

= Handling