You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (Jira)" <ji...@apache.org> on 2019/08/30 12:50:00 UTC

[jira] [Updated] (TIKA-2933) Revisit "replacement" encoding mappings in StandardHtmlEncodingDetector.

     [ https://issues.apache.org/jira/browse/TIKA-2933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Allison updated TIKA-2933:
------------------------------
    Description: 
Revisit "replacement" encoding mappings in StandardHtmlEncodingDetector.

I'm finally getting around to running the comparisons between our legacy HTMLEncodingDetector and the newer StandardHTMLEncodingDetector.  More analysis is required, but the newer one is, generally better*.  One area for improvement/explanation, though is in the "replacement" encoding. 

* There are 1 million more "common words" in text extracted from files with the StandardHtmlEncodingDetector than with only our legacy.  There are 133M common words in our legacy extracts so that's less than 1% improvement.

  was:
Revisit "replacement" encoding mappings in StandardHtmlEncodingDetector.

I'm finally getting around to running the comparisons between our legacy HTMLEncodingDetector and the newer StandardHTMLEncodingDetector.  More analysis is required, but the newer one is, generally, much better.  One area for improvement/explanation, though is in the "replacement" encoding. 


> Revisit "replacement" encoding mappings in StandardHtmlEncodingDetector.
> ------------------------------------------------------------------------
>
>                 Key: TIKA-2933
>                 URL: https://issues.apache.org/jira/browse/TIKA-2933
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Tim Allison
>            Priority: Major
>
> Revisit "replacement" encoding mappings in StandardHtmlEncodingDetector.
> I'm finally getting around to running the comparisons between our legacy HTMLEncodingDetector and the newer StandardHTMLEncodingDetector.  More analysis is required, but the newer one is, generally better*.  One area for improvement/explanation, though is in the "replacement" encoding. 
> * There are 1 million more "common words" in text extracted from files with the StandardHtmlEncodingDetector than with only our legacy.  There are 133M common words in our legacy extracts so that's less than 1% improvement.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)