You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Brian Young <bw...@gmail.com> on 2017/09/22 14:27:39 UTC

CharsetDetector vs EncodingDetector

Hello,

We had code that was using CharsetDetector and after upgrading to 1.16 it
is now returning different answers than it did in older versions.  After
digging in a bit I noticed that AutoDetectReader uses EncodingDetector,
which seems to mirror my primary use case so I am switching to that.   So I
can surmise that I was likely using the wrong class/wrong approach before.

What we are doing now is creating a throw away AutoDetectReader and
grabbing the detected charset from it.

However that leaves me wondering, how/where is CharsetDetector used?  I've
been studying CharsetDetector and EncodingDetector and trying to find some
information on when I would use one vs. the other and it isn't clear to me
yet.

Thank you,
Brian

RE: CharsetDetector vs EncodingDetector

Posted by "Allison, Timothy B." <ta...@mitre.org>.
CharsetDetector is our copy/paste of ICU4j’s encoding detector.  We wrap it as an EncodingDetector in our org.apache.tika.parser.txt.Icu4jEncodingDetector.

The AutoDetectReader loads 3 EncodingDetectors specified in the tika/parsers/resources/META-INF/services/o.a.t.detect.EncodingDetector service file:

org.apache.tika.parser.html.HtmlEncodingDetector
org.apache.tika.parser.txt.UniversalEncodingDetector
org.apache.tika.parser.txt.Icu4jEncodingDetector

It runs through them in order, and whichever one has a non-null value first is the value that is returned.

We did do a fresh copy/paste of ICU4j before 1.16 IIRC.

You can modify the order of the encoding detectors or even which ones are used via tika-config.xml. See e.g.: https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/org/apache/tika/config/TIKA-2273-no-icu4j-encoding-detector.xml

In short you can experiment with each of the 3 to figure out which one works best and then determine the best order in which to apply them. 😊

If you have time and the interest, I’d run each of the 3 (or just 2 if you know you don’t have html) and then use tika-eval [1] to see which gives you higher “common words” scores (where “common words” is the count of words in your extracts that are in the top 20000 common words extracted from Wikipedia for the detected language).  You have the rare opportunity to be the 2nd person in the world to use tika-eval.

Oh, and once you’ve done that, you can chip in on TIKA-2038.

Cheers,

         Tim

[1] https://wiki.apache.org/tika/TikaEval

From: Brian Young [mailto:bwyoung.spam@gmail.com]
Sent: Friday, September 22, 2017 10:28 AM
To: user@tika.apache.org
Subject: CharsetDetector vs EncodingDetector

Hello,

We had code that was using CharsetDetector and after upgrading to 1.16 it is now returning different answers than it did in older versions.  After digging in a bit I noticed that AutoDetectReader uses EncodingDetector, which seems to mirror my primary use case so I am switching to that.   So I can surmise that I was likely using the wrong class/wrong approach before.

What we are doing now is creating a throw away AutoDetectReader and grabbing the detected charset from it.

However that leaves me wondering, how/where is CharsetDetector used?  I've been studying CharsetDetector and EncodingDetector and trying to find some information on when I would use one vs. the other and it isn't clear to me yet.

Thank you,
Brian