You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tyler Palsulich (JIRA)" <ji...@apache.org> on 2015/03/01 23:21:04 UTC

[jira] [Closed] (TIKA-669) Backup plan for parsing

     [ https://issues.apache.org/jira/browse/TIKA-669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tyler Palsulich closed TIKA-669.
--------------------------------
    Resolution: Duplicate

> Backup plan for parsing
> -----------------------
>
>                 Key: TIKA-669
>                 URL: https://issues.apache.org/jira/browse/TIKA-669
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>
> Currently once a document type has been detected we direct the document to the one parser that best matches the detected type. In practice there are cases where that parser finds that it in fact cannot parse this document, for example when something that looked like XML turns out to have syntax errors. For such cases it would be nice if the CompositeParser could then retry parsing the document with a more generic backup parser, like the plain text parser for malformed XML.
> Implementing this would require some level of buffering and redirection of both parser input and output. Input buffering is easy, but for output buffering we'd probably need to implement new ContentHandler and Metadata layers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)