You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Gerard Bouchar (JIRA)" <ji...@apache.org> on 2018/07/13 09:29:00 UTC

[jira] [Comment Edited] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification

    [ https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541307#comment-16541307 ] 

Gerard Bouchar edited comment on TIKA-2673 at 7/13/18 9:28 AM:
---------------------------------------------------------------

[~tallison@apache.org] : great, thank you very much ! Of course I agree for it to be merged. I'm sorry for forgetting the license header in the first place.

I have done more work on this in the last days. I am going to make a pull request to include my last changes.

We have conducted an internal testing on this, and have seen great results. We selected a random subset of ~100 000 URLs from a nutch segment, fetched it once in nutch, and parsed it using different strategies. We fetched the same URLs using puppeteer (a headless chrome), and compared the charset detected. Here are the results

 

{{                 correct similar wrong}}
{{standard           99.4%    0.0%  0.6%}}
{{standard_noparse   94.7%    4.6%  0.6%}}
{{default            85.9%   11.5%  2.6%}}
{{icu                79.1%   13.9%  7.0%}}

 

 

!image-2018-07-13-11-28-16-657.png!

standard_noparse is a composite detector with a version of my detector that just takes into account the BOM and HTTP headers, chained with the existing HtmlEncodingDetector, chained with Icu4JEncodingDetector.

standard is a composite detector with the last version of my detector, chained with Icu4JEncodingDetector.

Labeled as "correct" are the pages that were detected the same in chrome and tika. "similar" means that although incorrect, the detected charset is close to the one detected by chrome (ISO-8859-1 instead of WINDOWS-1254, for instance). "wrong" means that the detected charset was not close to the one detected by chrome.


was (Author: gbouchar):
[~tallison@apache.org] : great, thank you very much ! Of course I agree for it to be merged. I'm sorry for forgetting the license header in the first place.

I have done more work on this in the last days. I am going to make a pull request to include my last changes.

We have conducted an internal testing on this, and have seen great results. We selected a random subset of ~100 000 URLs from a nutch segment, fetched it once in nutched, and parsed it using different strategies. We fetched the same URLs using puppeteer (a headless chrome), and compared the charset detected. Here are the results

!https://confluence.qwant.ninja/confluence/download/attachments/25790597/image2018-7-11_16-50-32.png?version=1&modificationDate=1531320645751&api=v2!

standard_noparse is a composite detector with a version of my detector that just takes into account the BOM and HTTP headers, chained with the existing HtmlEncodingDetector, chained with Icu4JEncodingDetector.

standard is a composite detector with the last version of my detector, chained with Icu4JEncodingDetector.

Labeled as "correct" are the pages that were detected the same in chrome and tika. "similar" means that although incorrect, the detected charset is close to the one detected by chrome (ISO-8859-1 instead of WINDOWS-1254, for instance). "wrong" means that the detected charset was not close to the one detected by chrome.

> HtmlEncodingDetector doesn't follow the specification
> -----------------------------------------------------
>
>                 Key: TIKA-2673
>                 URL: https://issues.apache.org/jira/browse/TIKA-2673
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Gerard Bouchar
>            Priority: Major
>         Attachments: HtmlEncodingDetectorTest.java, StrictHtmlEncodingDetector.tar.gz, image-2018-07-13-11-28-16-657.png
>
>
> This bug is linked to TIKA-2671, but does not concern metadata, but rather the bytes-based detection itself.
> While reading the specification, I collected a list of sample cases where HtmlEncodingDetector differs from the specification, and thus fails at detecting the right charset.
> I am attaching the test cases to this issue: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)