You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Katsuya Tomioka <ka...@gmail.com> on 2019/11/12 18:48:58 UTC

Encoding detectors in OSGi (tika-bundle)

I'm having trouble accessing encoding detectors in OSGi with Tika 1.22. AutoDetectParser returns "Failed to detect the character encoding of a document" for non-Latin text. We are migrating from 1.10, I'm sure many things are different. It seems like my problem is while all the detectors are in tika-parser, the code is  loading from tika-core's. I see parsers and detectors are tracked as services. Do I need to do something similar to load encoding detectors as well?

Thanks,

-Katsuya

Re: Encoding detectors in OSGi (tika-bundle)

Posted by Katsuya Tomioka <ka...@gmail.com>.
My current approach is to set ServiceLoader's  context class loader which seems to be working.  It's a bit awkward, but I'm doing like:
    ServiceLoader.setContextClassLoader(Icu4jEncodingDetector.class.getClassLoader());


On 2019/11/13 06:44:02, Nick Burch <ap...@gagravarr.org> wrote: 
> On Tue, 12 Nov 2019, Katsuya Tomioka wrote:
> > I'm having trouble accessing encoding detectors in OSGi with Tika 1.22. 
> > AutoDetectParser returns "Failed to detect the character encoding of a 
> > document" for non-Latin text. We are migrating from 1.10, I'm sure many 
> > things are different. It seems like my problem is while all the 
> > detectors are in tika-parser, the code is loading from tika-core's. I 
> > see parsers and detectors are tracked as services. Do I need to do 
> > something similar to load encoding detectors as well?
> 
> The things which are currently loaded via services are:
>   * Parsers
>   * Detectors (file type)
>   * Translators
>   * Encoding Detection
>   * Langauge Detection
>   * Probability-based type detectors
> 
> I think there might be helpers to assist with those, hopefully one of our 
> OSGi experts will be along shortly to advise!
> 
> Nick
> 

Re: Encoding detectors in OSGi (tika-bundle)

Posted by Nick Burch <ap...@gagravarr.org>.
On Tue, 12 Nov 2019, Katsuya Tomioka wrote:
> I'm having trouble accessing encoding detectors in OSGi with Tika 1.22. 
> AutoDetectParser returns "Failed to detect the character encoding of a 
> document" for non-Latin text. We are migrating from 1.10, I'm sure many 
> things are different. It seems like my problem is while all the 
> detectors are in tika-parser, the code is loading from tika-core's. I 
> see parsers and detectors are tracked as services. Do I need to do 
> something similar to load encoding detectors as well?

The things which are currently loaded via services are:
  * Parsers
  * Detectors (file type)
  * Translators
  * Encoding Detection
  * Langauge Detection
  * Probability-based type detectors

I think there might be helpers to assist with those, hopefully one of our 
OSGi experts will be along shortly to advise!

Nick