You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by Apache Wiki <wi...@apache.org> on 2015/01/14 00:10:08 UTC

[Tika Wiki] Update of "CompositeParserDiscussion" by NickBurch

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.

The "CompositeParserDiscussion" page has been changed by NickBurch:
https://wiki.apache.org/tika/CompositeParserDiscussion?action=diff&rev1=1&rev2=2

Comment:
Formatting and expand

- =Composite Parser Discussion=
+ = Composite Parser Discussion =
- A given mime type may be supported by several parsers.  Work on TIKA-1445 (adding metadata back into OCR'd text) raised the prominence of this issue.  Currently, the CompositeParser picks the first parser that supports a given mime type.  In discussion on TIKA-1445 other potential use cases were identified.
+ 
+ A given mime type may be supported by several parsers.  Work on [[https://issues.apache.org/jira/browse/TIKA-1445|TIKA-1445]] (adding metadata back into OCR'd text) raised the prominence of this issue.  Currently, the CompositeParser picks the first parser that supports a given mime type.  In discussion on TIKA-1445 other potential use cases were identified.
  
  The purpose of this page is to track a unified vision of the strategies that we'll implement in Tika.
  
  The JIRA issue for this is [[https://issues.apache.org/jira/browse/TIKA-1509|TIKA-1509]].
  
  '''This page is just a start.  Please contribute'''
+ 
- =Strategies=
+ = Strategies =
- ==Classic==
+ == Classic ==
  Sort the parsers by non-tika vs tika and then alphabetically by class name.  Pick the first parser that will handle a given mime type.
  
- ==Supplementary/Additive==
+ == Supplementary/Additive ==
  Concatenate the results (metadata and content) for several parsers
  
  We need a better name for this!
  
- ==Back-off==
+ == Back-off ==
  Try one parser and if the output doesn't meet some criterion, apply another.  One use case for this might be: if a file is identified as XML, try the XMLParser and if that throws an exception, try the HTMLParser. 
  
- ==Pick the Best Output==
+ == Pick the Best Output ==
  One use case for this: the charset detector identifies two equally likely charsets.  Apply both and use the wished-for junk detector (TIKA-1443) to determine which output is more likely to be not junk.
  
+ == Fastest ==
+ If there are two parsers, use the faster one even if it might mean lower quality (eg avoid OCR)
+ 
+ = Allowing the User to select a strategy =
+ The right strategy for one user may not be the right for another. The right strategy for one file may not be the right one for another. We therefore need to allow users to pick their strategy, on an overall basis, and on a per-file basis
+ 
+ == From TikaConfig ==
+ ''TODO''
+ 
+ == With a Tika Configuration file ==
+ ''TODO''
+ 
+ == In Code ==
+ ''TODO''
+