You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Albretch Mueller <lb...@gmail.com> on 2013/12/14 05:04:13 UTC
language detection in tika ...
On the sections 7.2 (pg. 115) ... of "tika in action", they talk in
very general terms about that theme and mentioned that tika currently
uses n-grams but may change the underlying algorithm in the future
Could you/committers/the autors share a little more about tika's
language detection internals and/or your probable future
decisions/plans?
thanks
lbrtchx
Re: language detection in tika ...
Posted by Nick Burch <ap...@gagravarr.org>.
On Sat, 14 Dec 2013, Albretch Mueller wrote:
> On the sections 7.2 (pg. 115) ... of "tika in action", they talk in very
> general terms about that theme and mentioned that tika currently uses
> n-grams but may change the underlying algorithm in the future
I think it's based on tri-grams, with some code originally from Nutch, but
I'm not certain. There has certainly been talk of using some more recent
code, quite possibly with a wider range of gram sizes (is that the right
term?), but it's not an area of the codebase I'm all that strong on
Nick
Re: language detection in tika ...
Posted by Albretch Mueller <lb...@gmail.com>.
I meant to mention that my algo only works for alphabetic languages
(which are the ones that give a harder time anyway?) and one issue
that I wonder about regarding tika
tika.apache.org/1.2/api/org/apache/tika/language/LanguageIdentifier.html
is that you don't see an:
.isAlphabetic() {true, false}
test as part of the API
lbrtchx
Re: language detection in tika ...
Posted by Albretch Mueller <lb...@gmail.com>.
give me like two weeks and I may have some good ideas (based of
Mathematics/pattern recognition) which I could even implement
lbrtchx
Re: language detection in tika ...
Posted by Ken Krugler <kk...@transpac.com>.
On Dec 13, 2013, at 8:04pm, Albretch Mueller <lb...@gmail.com> wrote:
> On the sections 7.2 (pg. 115) ... of "tika in action", they talk in
> very general terms about that theme and mentioned that tika currently
> uses n-grams but may change the underlying algorithm in the future
>
> Could you/committers/the autors share a little more about tika's
> language detection internals and/or your probable future
> decisions/plans?
Currently it's based on some code that came over from Nutch, with a few improvements.
It has a number of issues, e.g. see…
https://issues.apache.org/jira/browse/TIKA-369
https://issues.apache.org/jira/browse/TIKA-856
https://issues.apache.org/jira/browse/TIKA-354
https://issues.apache.org/jira/browse/TIKA-568
https://issues.apache.org/jira/browse/TIKA-496
https://issues.apache.org/jira/browse/TIKA-993
https://issues.apache.org/jira/browse/TIKA-465
There's a proposal to replace this with language-detection, a separate library that has better accuracy and much faster performance. See…
https://issues.apache.org/jira/browse/TIKA-369
And yes, that's been sitting on my plate for way too long. If somebody wants to put a release stake in the ground, it would help motivate me to at least close out that issue :)
Regards,
-- Ken
--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr