You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2017/02/02 13:07:51 UTC

[jira] [Commented] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

    [ https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15849891#comment-15849891 ] 

Tim Allison commented on TIKA-2038:
-----------------------------------

bq. The accuracy of Tika in overall, i.e. 72%, is less than from the accuracy of JUniversalCharDet that is 74%!! It’s an odd phenomenon because JUniversalCharDet is a sub-component of Tika. I think this is due to the way you use JUniversalCharDet in Tika; that is a kind of early-termination in data feeding … listener.handleData(b, 0, m);
In contrast, in this comparison I used feed-all approach as follows …
detector.handleData(rawHtmlByteSequence, 0, rawHtmlByteSequence.length);

Y, this makes sense.  The other potential cause is if an html page misidentifies its encoding via a meta-header, Tika will rely on that without running the other detectors.


bq. The ASF's Jira doesn't allow to upload files greater than 19.54 MB.  

Right.  On further thought, I would like to build a smallish corpus from Common Crawl for this purpose.  If we did random sampling by url country code (.iq, .kr, etc.) for the countries you've identified, would that meet our needs?


> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-2038
>                 URL: https://issues.apache.org/jira/browse/TIKA-2038
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, detector
>            Reporter: Shabanali Faghani
>            Priority: Minor
>         Attachments: comparisons_20160803b.xlsx, comparisons_20160804.xlsx, iust_encodings.zip, lang-wise-eval_results.zip, lang-wise-eval_runnable.zip, lang-wise-eval_source_code.zip, tika_1_14-SNAPSHOT_encoding_detector.zip
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents as well as the other naturally text documents. But the accuracy of encoding detector tools, including icu4j, in dealing with the HTML documents is meaningfully less than from which the other text documents. Hence, in our project I developed a library that works pretty well for HTML documents, which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as Nutch, Lucene, Solr, etc. and these projects are strongly in connection with the HTML documents, it seems that having such an facility in Tika also will help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)