You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Shabanali Faghani (JIRA)" <ji...@apache.org> on 2017/03/04 18:32:45 UTC

[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

    [ https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15887801#comment-15887801 ] 

Shabanali Faghani edited comment on TIKA-2038 at 3/4/17 6:31 PM:
-----------------------------------------------------------------

Perfect reply, [~tallison@mitre.org]. Thank you!
 
bq. The current version of the stripper leaves in <meta > headers if they also include "charset". … I included the output of the stripped HTMLMeta detector as a sanity check … (/)
 
bq. I figure that we'll be modifying the stripper …
 
We might need the stripper works like a SAX parser, i.e the input should be _InputStream_. This is required if we decided to be too conservative about OOM error or avoiding from resource wasting for big html files. I know writing a perfect _html stream stripper_ with the minimal faults (false-negative/positive, exception, …) is very hard. As a SAX parser, TagSoup should be able to to do so but there are two problems including _chicken and egg_ and _performance_. The former problem can be solved by _ISO-8859-1 encoding-decoding_ trick but there is no solution for the latter.

For a lightweight SAX-style stripper I think we can ask [Jonathan Hedley| https://jhy.io/], the author of Jsoup or someone else in Jsoup’s mailing list that if they ever have done a thing like this or could they help us. We may also suggest/introduce IUST (the standalone version) to them. IIRC, in Jsoup 1.6.1-3 (and most likely now) the charset of a page was supposed/considered as UTF-8 if the http header didn’t contain any charset or the charset was not specified in input.
 
bq. … and possibly IUST.
 
The current version of IUST, i.e htmlchardet-1.0.1, uses _early-termination_ for neither JCharDet nor ICU4j! So, we should write a custom version of IUST to do so. Nevertheless, I think we can ignore this for the first version because I think that haven’t a meaningful effect on the algorithm. In fact I think calling the detection methods of JCharDet and ICU4j with InputStream input will a bit increase the efficiency in charge of a bit decrease in the accuracy.
 
bq. I didn't use IUST because this was a preliminary run, and I wasn't sure which version I should use. The one on github or the proposed modification above or both? Let me know which code you'd like me to run.
 
The _modified IUST_ isn’t yet complete. To complete it we must prepare a thorough list of languages for which the stripping shouldn’t be done.  These languages/tlds are determined by comparing the results of the IUST with and without stripping. So, you should run both _htmlchardet-1.0.1.jar_ (IUST whit stripping) with _lookInMeta=false_ and the class _IUSTWithoutMarkupElimination_ (IUST without stripping) from the [lang-wise-eval source code| https://issues.apache.org/jira/secure/attachment/12848364/lang-wise-eval_source_code.zip]. The accuracy of _modified IUST_ (the pseudo code above) can be computed algorithmically by selecting the best from the two for each language/tld .
 
bq. I want to focus on accuracy first. We still have to settle on an eval method. But, yes, I do want to look at this. (/)


was (Author: faghani):
Perfect reply, [~tallison@mitre.org]. Thank you!
 
bq. The current version of the stripper leaves in <meta > headers if they also include "charset". … I included the output of the stripped HTMLMeta detector as a sanity check … (/)
 
bq. I figure that we'll be modifying the stripper …
 
We might need the stripper works like a SAX parser, i.e the input should be _InputStream_. This is required if we decided to be too conservative about OOM error or avoiding from resource wasting for big html files. I know writing a perfect _html stream stripper_ with the minimal faults (false-negative/positive, exception, …) is very hard. As a SAX parser, TagSoup should be able to to do so but there are two problems including _chicken and egg_ and _performance_. The former problem can be solved by _ISO-8859-1 encoding-decoding_ trick but there is no solution for the latter.

For a lightweight SAX-style stripper I think we can ask [Jonathan Hedley| https://jhy.io/], the author of Jsoup or someone else in Jsoup’s mailing list that if they ever have done a thing like this or could they help us. We may also suggest/introduce IUST (the standalone version) to them. This is quite like a gif entitled “_Adding a citation to a paper possibly written by the reviewer_” in [phd funnies| http://users.auth.gr/ksiop/phd_funny/index.html], mutual scratching!! IIRC, in Jsoup 1.6.1-3 (and most likely now) the charset of a page was supposed/considered as UTF-8 if the http header didn’t contain any charset or the charset was not specified in input.
 
bq. … and possibly IUST.
 
The current version of IUST, i.e htmlchardet-1.0.1, uses _early-termination_ for neither JCharDet nor ICU4j! So, we should write a custom version of IUST to do so. Oh, still a lot of works to do … :( Nevertheless, I think we can ignore this for the first version because I think that haven’t a meaningful effect on the algorithm. In fact I think calling the detection methods of JCharDet and ICU4j with InputStream input will a bit increase the efficiency in charge of a bit decrease in the accuracy.
 
bq. I didn't use IUST because this was a preliminary run, and I wasn't sure which version I should use. The one on github or the proposed modification above or both? Let me know which code you'd like me to run.
 
The _modified IUST_ isn’t yet complete. To complete it we must prepare a thorough list of languages for which the stripping shouldn’t be done.  These languages/tlds are determined by comparing the results of the IUST with and without stripping. So, you should run both _htmlchardet-1.0.1.jar_ (IUST whit stripping) with _lookInMeta=false_ and the class _IUSTWithoutMarkupElimination_ (IUST without stripping) from the [lang-wise-eval source code| https://issues.apache.org/jira/secure/attachment/12848364/lang-wise-eval_source_code.zip]. The accuracy of _modified IUST_ (the pseudo code above) can be computed algorithmically by selecting the best from the two for each language/tld .
 
bq. I want to focus on accuracy first. We still have to settle on an eval method. But, yes, I do want to look at this. (/)

> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-2038
>                 URL: https://issues.apache.org/jira/browse/TIKA-2038
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, detector
>            Reporter: Shabanali Faghani
>            Priority: Minor
>         Attachments: comparisons_20160803b.xlsx, comparisons_20160804.xlsx, iust_encodings.zip, lang-wise-eval_results.zip, lang-wise-eval_runnable.zip, lang-wise-eval_source_code.zip, proposedTLDSampling.csv, tika_1_14-SNAPSHOT_encoding_detector.zip, tld_text_html_plus_H_column.xlsx, tld_text_html.xlsx
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents as well as the other naturally text documents. But the accuracy of encoding detector tools, including icu4j, in dealing with the HTML documents is meaningfully less than from which the other text documents. Hence, in our project I developed a library that works pretty well for HTML documents, which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as Nutch, Lucene, Solr, etc. and these projects are strongly in connection with the HTML documents, it seems that having such an facility in Tika also will help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)