You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Shabanali Faghani (JIRA)" <ji...@apache.org> on 2018/12/02 22:33:00 UTC
[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

    [ https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16698268#comment-16698268 ] 

Shabanali Faghani edited comment on TIKA-2038 at 12/2/18 10:32 PM:
-------------------------------------------------------------------

[~HansBrende] thank you for your interest to IUST and for your great analysis.

With regard to your work here and on TIKA-2771 and also [CommonCrawl3|https://wiki.apache.org/tika/CommonCrawl3] and TIKA-2750 by [~tallison@apache.org], looks like it's the time to resume this thread.

The algorithm of jchardet is just like what you've described. To make IUST more efficient and standalone with no dependency, I did also a small try to separate jchardet's UTF-8 detector after my last comment here. If I remember correctly, it keeps a small list that is correlated to its detectors and at the end of detection process it scans this list to find the best match. So, I thought it's impossible to split its UTF-8 detector, because sometimes it might detect the charset of a page something other than UTF-8 due to a higher probability in precence of UTF-8 in the list. If this would be true, in absence of other detectors jchardet will detect these cases as UTF-8 and this means that its false-positive for UTF-8 will be increased (true-negative will be decreased), ... don't know maybe dramatically!

I'll test the false-positive and true-negative of f8 and compare it with jchardet. Hope I've been wrong.

I'll take care of this next week ... now I'm on holiday and am typing with my mobile phone!


was (Author: faghani):
[~HansBrende] thank you for your interest to IUST and for your great analysis.

With regard to your work here and on Tika-2771 and also [CommonCrawl3|https://wiki.apache.org/tika/CommonCrawl3] and Tika-2750 by [~tallison@apache.org], looks like it's the time to resume this thread.

The algorithm of jchardet is just like what you've described. To make IUST more efficient and standalone with no dependency, I did also a small try to separate jchardet's UTF-8 detector after my last comment here. If I remember correctly, it keeps a small list that is correlated to its detectors and at the end of detection process it scans this list to find the best match. So, I thought it's impossible to split its UTF-8 detector, because sometimes it might detect the charset of a page something other than UTF-8 due to a higher probability in precence of UTF-8 in the list. If this would be true, in absence of other detectors jchardet will detect these cases as UTF-8 and this means that its false-positive for UTF-8 will be increased (true-negative will be decreased), ... don't know maybe dramatically!

I'll test the false-positive and true-negative of f8 and compare it with jchardet. Hope I've been wrong.

I'll take care of this next week ... now I'm on holiday and am typing with my mobile phone!

> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-2038
>                 URL: https://issues.apache.org/jira/browse/TIKA-2038
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, detector
>            Reporter: Shabanali Faghani
>            Priority: Minor
>         Attachments: comparisons_20160803b.xlsx, comparisons_20160804.xlsx, iust_encodings.zip, lang-wise-eval_results.zip, lang-wise-eval_runnable.zip, lang-wise-eval_source_code.zip, proposedTLDSampling.csv, tika_1_14-SNAPSHOT_encoding_detector.zip, tld_text_html.xlsx, tld_text_html_plus_H_column.xlsx
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents as well as the other naturally text documents. But the accuracy of encoding detector tools, including icu4j, in dealing with the HTML documents is meaningfully less than from which the other text documents. Hence, in our project I developed a library that works pretty well for HTML documents, which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as Nutch, Lucene, Solr, etc. and these projects are strongly in connection with the HTML documents, it seems that having such an facility in Tika also will help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)