You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Jukka Zitting (JIRA)" <ji...@apache.org> on 2013/12/28 02:14:51 UTC

[jira] [Resolved] (TIKA-1193) Allow access to HtmlParser's HtmlSchema

     [ https://issues.apache.org/jira/browse/TIKA-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-1193.
---------------------------------

    Resolution: Fixed

The patch is perfect, thanks! Committed in revision 1553774.

> Allow access to HtmlParser's HtmlSchema
> ---------------------------------------
>
>                 Key: TIKA-1193
>                 URL: https://issues.apache.org/jira/browse/TIKA-1193
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>            Assignee: Jukka Zitting
>             Fix For: 1.5
>
>         Attachments: TIKA-1193-trunk.patch, TIKA-1193-trunk.patch
>
>
> TagSoup's HTMLSchema is not really well suited for HTML5 nor is it capable of correctly handling some very strange quirks, e.g. table inside anchors. By allowing access to the schema applications can modify the schema to suit their needs on the fly.
> This would also mean that we don't have to rely on TIKA-985 getting committed, we can change it from our own applications.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)