You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Shabanali Faghani (JIRA)" <ji...@apache.org> on 2016/08/05 21:22:20 UTC

[jira] [Updated] (TIKA-2050) HTMLEncodingDetector class fails on some HTML documents

     [ https://issues.apache.org/jira/browse/TIKA-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shabanali Faghani updated TIKA-2050:
------------------------------------
    Attachment: false-negative-responce-from-HTMLEncodingDetector.zip

> HTMLEncodingDetector class fails on some HTML documents
> -------------------------------------------------------
>
>                 Key: TIKA-2050
>                 URL: https://issues.apache.org/jira/browse/TIKA-2050
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Shabanali Faghani
>            Priority: Minor
>         Attachments: false-negative-responce-from-HTMLEncodingDetector.zip
>
>
> When [~tallison@mitre.org] and I were working on [TIKA-2038|https://issues.apache.org/jira/browse/TIKA-2038] I found out that HTMLEncodingDetector class cannot extract charsets from some HTML documents. I’ve attached the HTML documents that HTMLEncodingDetector fails on them. It seems that its regex should be corrected to cover these cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)