You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Albretch Mueller <lb...@gmail.com> on 2013/12/14 05:04:13 UTC

language detection in tika ...

 On the sections 7.2 (pg. 115) ... of "tika in action", they talk in
very general terms about that theme and mentioned that tika currently
uses n-grams but may change the underlying algorithm in the future

 Could you/committers/the autors share a little more about tika's
language detection internals and/or your probable future
decisions/plans?

 thanks
 lbrtchx

Re: language detection in tika ...

Posted by Nick Burch <ap...@gagravarr.org>.

On Sat, 14 Dec 2013, Albretch Mueller wrote:
> On the sections 7.2 (pg. 115) ... of "tika in action", they talk in very 
> general terms about that theme and mentioned that tika currently uses 
> n-grams but may change the underlying algorithm in the future

I think it's based on tri-grams, with some code originally from Nutch, but 
I'm not certain. There has certainly been talk of using some more recent 
code, quite possibly with a wider range of gram sizes (is that the right 
term?), but it's not an area of the codebase I'm all that strong on

Nick

Re: language detection in tika ...

Posted by Albretch Mueller <lb...@gmail.com>.

 I meant to mention that my algo only works for alphabetic languages
(which are the ones that give a harder time anyway?) and one issue
that I wonder about regarding tika

 tika.apache.org/1.2/api/org/apache/tika/language/LanguageIdentifier.html

 is that you don't see an:

 .isAlphabetic() {true, false}

 test as part of the API

 lbrtchx

Re: language detection in tika ...

Posted by Albretch Mueller <lb...@gmail.com>.

 give me like two weeks and I may have some good ideas (based of
Mathematics/pattern recognition) which I could even implement

 lbrtchx

Re: language detection in tika ...

Posted by Ken Krugler <kk...@transpac.com>.

On Dec 13, 2013, at 8:04pm, Albretch Mueller <lb...@gmail.com> wrote:

> On the sections 7.2 (pg. 115) ... of "tika in action", they talk in
> very general terms about that theme and mentioned that tika currently
> uses n-grams but may change the underlying algorithm in the future
> 
> Could you/committers/the autors share a little more about tika's
> language detection internals and/or your probable future
> decisions/plans?

Currently it's based on some code that came over from Nutch, with a few improvements.

It has a number of issues, e.g. see…

https://issues.apache.org/jira/browse/TIKA-369

https://issues.apache.org/jira/browse/TIKA-856

https://issues.apache.org/jira/browse/TIKA-354

https://issues.apache.org/jira/browse/TIKA-568

https://issues.apache.org/jira/browse/TIKA-496

https://issues.apache.org/jira/browse/TIKA-993

https://issues.apache.org/jira/browse/TIKA-465

There's a proposal to replace this with language-detection, a separate library that has better accuracy and much faster performance. See…

https://issues.apache.org/jira/browse/TIKA-369

And yes, that's been sitting on my plate for way too long. If somebody wants to put a release stake in the ground, it would help motivate me to at least close out that issue :)

Regards,

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr