You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Bertrand Delacretaz (JIRA)" <ji...@apache.org> on 2007/10/13 10:26:50 UTC

[jira] Commented: (TIKA-58) Replace jtidy html parser with nekohtml based parser

    [ https://issues.apache.org/jira/browse/TIKA-58?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12534482 ] 

Bertrand Delacretaz commented on TIKA-58:
-----------------------------------------

Good idea, in Cocoon 2.1.x we have both JTidy and NekoHTML, and I find myself using Neko all the time.

> Replace jtidy html parser with nekohtml based parser
> ----------------------------------------------------
>
>                 Key: TIKA-58
>                 URL: https://issues.apache.org/jira/browse/TIKA-58
>             Project: Tika
>          Issue Type: Improvement
>          Components: general
>            Reporter: Sami Siren
>            Assignee: Sami Siren
>            Priority: Minor
>         Attachments: TIKA-58.diff
>
>
> Following patch will replace the JTidy based html parser with NekoHTML based sax parser. It only provides the same functionality that the JTidy based one (extracts a title into metadata) and passes other sax events through. Speed improvement is around 100%.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.