You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by "Allison, Timothy B." <ta...@mitre.org> on 2016/07/01 12:33:24 UTC

RE: Automatic Language Identification

+1 to langdetect

In Tika 2.0, we're going to remove our own language detection code and allow users to select Optimaize (fork of langdetect), MIT Lincoln Lab’s Text.jl library or Yalder (https://github.com/kkrugler/yalder).  The first two are now available in Tika 1.13.

-----Original Message-----
From: Markus Jelsma [mailto:markus.jelsma@openindex.io] 
Sent: Wednesday, June 22, 2016 8:27 AM
To: solr-user@lucene.apache.org; solr-user <so...@lucene.apache.org>
Subject: RE: Automatic Language Identification

Hello,

I recommend using the langdetect language detector, it supports many more languages and has much higher precission than Tika's detector.

Markus

Re: Automatic Language Identification

Posted by William Bell <bi...@gmail.com>.

We should add a simple filter in Solr for this. The current way requires
indexing.

https://github.com/kkrugler/yalder is good, it would be a great filter:

if NOT english, fail the whole text.

On Fri, Jul 1, 2016 at 6:33 AM, Allison, Timothy B. <ta...@mitre.org>
wrote:

> +1 to langdetect
>
> In Tika 2.0, we're going to remove our own language detection code and
> allow users to select Optimaize (fork of langdetect), MIT Lincoln Lab’s
> Text.jl library or Yalder (https://github.com/kkrugler/yalder).  The
> first two are now available in Tika 1.13.
>
> -----Original Message-----
> From: Markus Jelsma [mailto:markus.jelsma@openindex.io]
> Sent: Wednesday, June 22, 2016 8:27 AM
> To: solr-user@lucene.apache.org; solr-user <so...@lucene.apache.org>
> Subject: RE: Automatic Language Identification
>
> Hello,
>
> I recommend using the langdetect language detector, it supports many more
> languages and has much higher precission than Tika's detector.
>
> Markus
>
>
>


-- 
Bill Bell
billnbell@gmail.com
cell 720-256-8076