You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2019/01/07 19:59:00 UTC
[jira] [Resolved] (TIKA-2810) Back off to tagsoup when xml parser
fails on Tika xhtml in tika-eval
[ https://issues.apache.org/jira/browse/TIKA-2810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Allison resolved TIKA-2810.
-------------------------------
Resolution: Fixed
Assignee: Tim Allison
Fix Version/s: 1.21
2.0.0
> Back off to tagsoup when xml parser fails on Tika xhtml in tika-eval
> --------------------------------------------------------------------
>
> Key: TIKA-2810
> URL: https://issues.apache.org/jira/browse/TIKA-2810
> Project: Tika
> Issue Type: Improvement
> Reporter: Tim Allison
> Assignee: Tim Allison
> Priority: Major
> Fix For: 2.0.0, 1.21
>
>
> On TIKA-2791, we added extraction of structure tags. If there's a parse failure on Tika's xhtml, we initially backed off to treat the full xhtml as if it were a string of text that happened to include markup.
> It would be better to back off to the html parser so that content comparisons can still work accurately even if there is a tag failure: <b><i></b></i>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)