You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Julia Ruzicka <ju...@simutech.at> on 2021/02/01 13:39:40 UTC

WG: Detecting multiple languages in a long text

Hello everyone!

 

I'm using Tika 1.25 to detect the language of a long text that I read from a
PDF (using PDFBox 2.0.22):

 

LanguageDetector detector = new OptimaizeLangDetector();

detector.loadModels();

List<LanguageResult> languages = detector.detectAll(text);

 

The text is about 400 pages and most of it is in English, with a couple of
pages in French, a few paragraphs in Greek and a couple of Arabic and German
sentences.

I know that language detection needs a long-ish text sample for the
detection to work, so I'm fine with the short Arabic/German sentences not
being detected. Running the code above with just a short sample in French or
Greek, the detector finds the right language but if I use the whole text as
input, the result is:

en (0.9999969) = English with a 99.99969% probability

 

It doesn't list the other languages.

 

If I give the detector a mixed sample, it only detects both languages if
they're about the same amount of text.

If one part in e.g. French is 5 lines of text (~65 words) and the second in
e.g. Greek is 7 lines of text (~80 word), the result is:

el (0.99999815) = Greek

 

With 55 words in French and 45 words in Greek the result is:

fr (0.5714264)

el (0.4285709)

 

I also tried to do it the alternative way:

 

detector.setMixedLanguages(true);

detector.addText(text);

List<LanguageResult> languages = detector.detectAll();

 

This also only lists a single language with the full text and my first
French-Greek text sample.

 

How do I get the other languages (in my case: French & Greek) as a result
too?

Re: Detecting multiple languages in a long text

Posted by Ken Krugler <kk...@transpac.com>.

Hi Julia,

So the goal is to have detection results show some non-zero probability for the other languages, right?

In general doing this for long runs of text is almost impossible using probabilistic models.

What you need to do is break the text up into some smaller units (by page or even better by paragraph, for example) and then do detection separately on each chunk of text.

Then based on those results, you can decide how you want to report actual content…which isn’t straightforward.

E.g. what if only one paragraph (out of many) had a 10% chance of being Greek, because it contained one sentence in Greek, but everything else was English? Would you want to report the total document as English, or English with some Greek, or something else?

Regards,

— Ken


> On Feb 1, 2021, at 5:39 AM, Julia Ruzicka <ju...@simutech.at> wrote:
> 
> Hello everyone!
>  
> I’m using Tika 1.25 to detect the language of a long text that I read from a PDF (using PDFBox 2.0.22):
>  
> LanguageDetector detector = new OptimaizeLangDetector();
> detector.loadModels();
> List<LanguageResult> languages = detector.detectAll(text);
>  
> The text is about 400 pages and most of it is in English, with a couple of pages in French, a few paragraphs in Greek and a couple of Arabic and German sentences.
> I know that language detection needs a long-ish text sample for the detection to work, so I'm fine with the short Arabic/German sentences not being detected. Running the code above with just a short sample in French or Greek, the detector finds the right language but if I use the whole text as input, the result is:
> en (0.9999969) = English with a 99.99969% probability
>  
> It doesn’t list the other languages.
>  
> If I give the detector a mixed sample, it only detects both languages if they’re about the same amount of text.
> If one part in e.g. French is 5 lines of text (~65 words) and the second in e.g. Greek is 7 lines of text (~80 word), the result is:
> el (0.99999815) = Greek
>  
> With 55 words in French and 45 words in Greek the result is:
> fr (0.5714264)
> el (0.4285709)
>  
> I also tried to do it the alternative way:
>  
> detector.setMixedLanguages(true);
> detector.addText(text);
> List<LanguageResult> languages = detector.detectAll();
>  
> This also only lists a single language with the full text and my first French-Greek text sample.
>  
> How do I get the other languages (in my case: French & Greek) as a result too?

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr