You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Shabanali Faghani (JIRA)" <ji...@apache.org> on 2018/12/02 19:57:00 UTC
[jira] [Commented] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

    [ https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16706453#comment-16706453 ] 

Shabanali Faghani commented on TIKA-2038:
-----------------------------------------

[~HansBrende] you were right!

Besides your tests, I've tested f8 and jchardet over the [^iust_encodings.zip] corpus and have seen that the accuracy of f8 and jchardet on detecting UTF-8 are quite equal. Moreover, when I used your proposed code ...
{code:java}
org.rypt.f8.Utf8Statistics stats = new org.rypt.f8.Utf8Statistics();
stats.write(rawHtmlByteSequence);
if (stats.countInvalid() == 0) {
    return "UTF-8";
}
{code}
f8 was ~10x faster than jchardet. That's while when I used this method from [f8's github site|https://github.com/HansBrende/f8#check-if-an-inputstream-is-100-valid-utf-8] ...
{code:java}
public static boolean isValidUtf8(InputStream is) throws IOException {
    int state = 0;
    int b;
    while ((b = is.read()) != -1) {
        state = Utf8.nextState(state, (byte)b);
        if (Utf8.isErrorState(state)) {
            return false;
        }
    }
    return state >= 0; //Or return true if stream was truncated
}
{code}
with the same accuracy, f8 was ~30x faster than jchardet on my machine! (These numbers may vary from machine to machine due to different system load, hotspot activation, different corpus, ... though). Since in IUST we don't need to statistics I much prefer the latter.

In real world conditions, however, when we use the second method I guess f8 will be > ~10x && < ~30x faster than the way I used jchardet in IUST. Because the above _while_ loop will be run completely for almost all UTf-8 documents and we know while UTF-8 compromises [almost 92%|https://w3techs.com/technologies/history_overview/character_encoding] of the web, its ratio in [^iust_encodings.zip] corpus is just 25%.

> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-2038
>                 URL: https://issues.apache.org/jira/browse/TIKA-2038
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, detector
>            Reporter: Shabanali Faghani
>            Priority: Minor
>         Attachments: comparisons_20160803b.xlsx, comparisons_20160804.xlsx, iust_encodings.zip, lang-wise-eval_results.zip, lang-wise-eval_runnable.zip, lang-wise-eval_source_code.zip, proposedTLDSampling.csv, tika_1_14-SNAPSHOT_encoding_detector.zip, tld_text_html.xlsx, tld_text_html_plus_H_column.xlsx
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents as well as the other naturally text documents. But the accuracy of encoding detector tools, including icu4j, in dealing with the HTML documents is meaningfully less than from which the other text documents. Hence, in our project I developed a library that works pretty well for HTML documents, which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as Nutch, Lucene, Solr, etc. and these projects are strongly in connection with the HTML documents, it seems that having such an facility in Tika also will help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)