You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Ilya Zavorin <iz...@caci.com> on 2012/02/27 16:53:47 UTC

Can I detect incorrect language selection after creating an index?

Suppose I have a bunch of text documents in language X but I index ithem using an analyzer for language Y. Once the index is created, is it possible to perform some sort of simple "sanity" check to see if the original language selection was wrong? I presume I can try searching for some common word in language Y, but I am not sure how reliable this would be. On the other hand, if languages are from the same group, say X and Y are English and Spanish, I should expect that this sanity check would produce a false match. However, I would be happy if it worked reliably enough for languages using different scripts, e.g. Latin vs Cyrillic vs Arabic vs Chinese etc.


Thanks much



Ilya Zavorin

Re: Can I detect incorrect language selection after creating an index?

Posted by Glen Newton <gl...@gmail.com>.
Do the check _before_ indexing.
Use https://code.google.com/p/language-detection/  to verify the
language of the text document before you put it in the index.

-Glen Newton
http://zzzoot.blogspot.com/

On Mon, Feb 27, 2012 at 10:53 AM, Ilya Zavorin <iz...@caci.com> wrote:
> Suppose I have a bunch of text documents in language X but I index ithem using an analyzer for language Y. Once the index is created, is it possible to perform some sort of simple "sanity" check to see if the original language selection was wrong? I presume I can try searching for some common word in language Y, but I am not sure how reliable this would be. On the other hand, if languages are from the same group, say X and Y are English and Spanish, I should expect that this sanity check would produce a false match. However, I would be happy if it worked reliably enough for languages using different scripts, e.g. Latin vs Cyrillic vs Arabic vs Chinese etc.
>
>
> Thanks much
>
>
>
> Ilya Zavorin



-- 
-
http://zzzoot.blogspot.com/
-

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org