You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2017/02/07 13:41:42 UTC

[jira] [Updated] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

     [ https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Allison updated TIKA-2038:
------------------------------
    Attachment: proposedTLDSampling.csv

I concatenated the tlds from your initial eval (github) with ones you mention in the last post, and I added a few others for kicks.

If the goal is to get ~30k per tld, let's sample to obtain 50k on the theory that there are duplicates and other reasons for failure. 

Any other tlds or mime defs we should add?

SQL to calculate these:
{noformat}
select tld, sum(n) as CountTextHTML, 
     case 
          when cast(50000 as float)/cast(sum(n) as float) > 1.0
          then 1.0
          else cast(50000 as float)/cast(sum(n) as float)
     end SamplingRate

from mimes_by_tld 
where tld in
('ae', 'af', 'cn', 'de', 'dz',
'eg', 'es', 'fr', 'gr', 'il',
'in', 'iq', 'ir', 'it', 'jo', 'jp',
'kp', 'kr', 'lb', 'pk', 'qa', 'ru',
'sa', 'sd', 'sy', 'tn', 'tr',
'tw', 'uk', 'us', 'vn', 'ye')
and 
(mime ilike '%html%'
or mime ilike '%text%')
group by tld
{noformat} 

> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-2038
>                 URL: https://issues.apache.org/jira/browse/TIKA-2038
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, detector
>            Reporter: Shabanali Faghani
>            Priority: Minor
>         Attachments: comparisons_20160803b.xlsx, comparisons_20160804.xlsx, iust_encodings.zip, lang-wise-eval_results.zip, lang-wise-eval_runnable.zip, lang-wise-eval_source_code.zip, proposedTLDSampling.csv, tika_1_14-SNAPSHOT_encoding_detector.zip
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents as well as the other naturally text documents. But the accuracy of encoding detector tools, including icu4j, in dealing with the HTML documents is meaningfully less than from which the other text documents. Hence, in our project I developed a library that works pretty well for HTML documents, which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as Nutch, Lucene, Solr, etc. and these projects are strongly in connection with the HTML documents, it seems that having such an facility in Tika also will help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)