You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by Mike Thomsen <mi...@gmail.com> on 2019/01/17 17:39:09 UTC

Chinese and Korea being detected as Lithuanian by LanguageDetector

I wrote a Groovy script (attached) to test a bunch of languages against the
LanguageDetector class, and these were the results:

ar    fa
de    de
en    en
es    es
fr    fr
gr    el
it    it
ko    lt
nl    nl
ru    ru
zh    lt

Is there something that needs to be done to enable the detection of Asian
languages or should I file this as a bug report?

Thanks,

Mike

Re: Chinese and Korea being detected as Lithuanian by LanguageDetector

Posted by Ken Krugler <kk...@transpac.com>.

Hi Mike,

So the issues are Arabic, Korean and Chinese, right?

I’d suggest filing an issue for Tika, so at least we can track it, though likely the issue is with the language-detector project we’re using for detection.

I’m leaving on a trip this evening, but back next week, so will try to look at it then.

Regards,

— Ken


> On Jan 17, 2019, at 1:48 PM, Mike Thomsen <mi...@gmail.com> wrote:
> 
> Ken,
> 
> Here's a Gist version of it:
> 
> https://gist.github.com/MikeThomsen/84abb89aab903a8b21d64af532cc369b
> 
> Thanks,
> 
> Mike
> 
> On Thu, Jan 17, 2019 at 2:25 PM Ken Krugler <kk...@transpac.com>
> wrote:
> 
>> Hi Mike,
>> 
>> I don’t see the script - did it get stripped?
>> 
>> Below is a list of the language profiles that I believe are bundled with
>> the language-detector jar that’s pulled in by Tika.
>> 
>> I don’t see “gr” - note that Greek is “el”.
>> 
>> And there’s “zh-CN” and “zh-TW” vs. just “zh”, but otherwise I’d expect
>> detection to work for your test cases.
>> 
>> — Ken
>> 
>> af
>> an
>> ar
>> ast
>> be
>> bg
>> bn
>> br
>> ca
>> cs
>> cy
>> da
>> de
>> el
>> en
>> es
>> et
>> eu
>> fa
>> fi
>> fr
>> ga
>> gl
>> gu
>> he
>> hi
>> hr
>> ht
>> hu
>> id
>> is
>> it
>> ja
>> km
>> kn
>> ko
>> lt
>> lv
>> mk
>> ml
>> mr
>> ms
>> mt
>> ne
>> nl
>> no
>> oc
>> pa
>> pl
>> pt
>> ro
>> ru
>> sk
>> sl
>> so
>> sq
>> sr
>> sv
>> sw
>> ta
>> te
>> th
>> tl
>> tr
>> uk
>> ur
>> vi
>> yi
>> zh-CN
>> zh-TW
>> 
>> 
>>> On Jan 17, 2019, at 9:39 AM, Mike Thomsen <mi...@gmail.com>
>> wrote:
>>> 
>>> I wrote a Groovy script (attached) to test a bunch of languages against
>> the LanguageDetector class, and these were the results:
>>> 
>>> ar    fa
>>> de    de
>>> en    en
>>> es    es
>>> fr    fr
>>> gr    el
>>> it    it
>>> ko    lt
>>> nl    nl
>>> ru    ru
>>> zh    lt
>>> 
>>> Is there something that needs to be done to enable the detection of
>> Asian languages or should I file this as a bug report?
>>> 
>>> Thanks,
>>> 
>>> Mike
>> 
>> --------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://www.scaleunlimited.com
>> Custom big data solutions & training
>> Flink, Solr, Hadoop, Cascading & Cassandra
>> 
>> 

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
Custom big data solutions & training
Flink, Solr, Hadoop, Cascading & Cassandra

Re: Chinese and Korea being detected as Lithuanian by LanguageDetector

Posted by Mike Thomsen <mi...@gmail.com>.

Ken,

Here's a Gist version of it:

https://gist.github.com/MikeThomsen/84abb89aab903a8b21d64af532cc369b

Thanks,

Mike

On Thu, Jan 17, 2019 at 2:25 PM Ken Krugler <kk...@transpac.com>
wrote:

> Hi Mike,
>
> I don’t see the script - did it get stripped?
>
> Below is a list of the language profiles that I believe are bundled with
> the language-detector jar that’s pulled in by Tika.
>
> I don’t see “gr” - note that Greek is “el”.
>
> And there’s “zh-CN” and “zh-TW” vs. just “zh”, but otherwise I’d expect
> detection to work for your test cases.
>
> — Ken
>
> af
> an
> ar
> ast
> be
> bg
> bn
> br
> ca
> cs
> cy
> da
> de
> el
> en
> es
> et
> eu
> fa
> fi
> fr
> ga
> gl
> gu
> he
> hi
> hr
> ht
> hu
> id
> is
> it
> ja
> km
> kn
> ko
> lt
> lv
> mk
> ml
> mr
> ms
> mt
> ne
> nl
> no
> oc
> pa
> pl
> pt
> ro
> ru
> sk
> sl
> so
> sq
> sr
> sv
> sw
> ta
> te
> th
> tl
> tr
> uk
> ur
> vi
> yi
> zh-CN
> zh-TW
>
>
> > On Jan 17, 2019, at 9:39 AM, Mike Thomsen <mi...@gmail.com>
> wrote:
> >
> > I wrote a Groovy script (attached) to test a bunch of languages against
> the LanguageDetector class, and these were the results:
> >
> > ar    fa
> > de    de
> > en    en
> > es    es
> > fr    fr
> > gr    el
> > it    it
> > ko    lt
> > nl    nl
> > ru    ru
> > zh    lt
> >
> > Is there something that needs to be done to enable the detection of
> Asian languages or should I file this as a bug report?
> >
> > Thanks,
> >
> > Mike
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> Custom big data solutions & training
> Flink, Solr, Hadoop, Cascading & Cassandra
>
>

Re: Chinese and Korea being detected as Lithuanian by LanguageDetector

Posted by Ken Krugler <kk...@transpac.com>.

Hi Mike,

I don’t see the script - did it get stripped?

Below is a list of the language profiles that I believe are bundled with the language-detector jar that’s pulled in by Tika.

I don’t see “gr” - note that Greek is “el”.

And there’s “zh-CN” and “zh-TW” vs. just “zh”, but otherwise I’d expect detection to work for your test cases.

— Ken

af
an
ar
ast
be
bg
bn
br
ca
cs
cy
da
de
el
en
es
et
eu
fa
fi
fr
ga
gl
gu
he
hi
hr
ht
hu
id
is
it
ja
km
kn
ko
lt
lv
mk
ml
mr
ms
mt
ne
nl
no
oc
pa
pl
pt
ro
ru
sk
sl
so
sq
sr
sv
sw
ta
te
th
tl
tr
uk
ur
vi
yi
zh-CN
zh-TW


> On Jan 17, 2019, at 9:39 AM, Mike Thomsen <mi...@gmail.com> wrote:
> 
> I wrote a Groovy script (attached) to test a bunch of languages against the LanguageDetector class, and these were the results:
> 
> ar    fa
> de    de
> en    en
> es    es
> fr    fr
> gr    el
> it    it
> ko    lt
> nl    nl
> ru    ru
> zh    lt
> 
> Is there something that needs to be done to enable the detection of Asian languages or should I file this as a bug report?
> 
> Thanks,
> 
> Mike

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
Custom big data solutions & training
Flink, Solr, Hadoop, Cascading & Cassandra