You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@uima.apache.org by Tommaso Teofili <to...@gmail.com> on 2008/12/08 10:23:15 UTC

Language recognition

Hello,
I am writing an AE pipeline and i need to recognize in which language the
starting document is written.
My idea is to use the Whitespace Tokenizer and the HMM Tagger together in
order to analyze the extracted tokens, calculate the percentage of well
known tokens for each language (against a dictionary) and then select the
highest percentage value language...
Do you know other (better) language recognition methods?
Thanks.
Tommaso

Re: Language recognition

Posted by Hannes Carl Meyer <ha...@googlemail.com>.

Hi Tommaso,

one common method for language recognition is based on n-grams.
There are also some java implementations out there, for example NGramJ:
http://ngramj.sourceforge.net/

Nutch (crawler from Lucene) also uses the n-gram approach, find some
information about here http://wiki.apache.org/nutch/LanguageIdentifier and
here http://wiki.apache.org/nutch/LanguageIdentifierPlugin

I wouldn't suggest to reinvent the wheel unless it is a bigger, faster one!

Regards

Hannes
---
http://mimblog.de

On Mon, Dec 8, 2008 at 10:23 AM, Tommaso Teofili
<to...@gmail.com>wrote:

> Hello,
> I am writing an AE pipeline and i need to recognize in which language the
> starting document is written.
> My idea is to use the Whitespace Tokenizer and the HMM Tagger together in
> order to analyze the extracted tokens, calculate the percentage of well
> known tokens for each language (against a dictionary) and then select the
> highest percentage value language...
> Do you know other (better) language recognition methods?
> Thanks.
> Tommaso
>

Re: Language recognition

Posted by Niels Ott <no...@sfs.uni-tuebingen.de>.

Torsten Zesch schrieb:
> you could use TextCat
> http://odur.let.rug.nl/~vannoord/TextCat/

This works quite well, but it is a bit slow.

If you simply want to know whether a document is written in a given 
language or not, the laziest way is to use a spell checker and compute 
the percentage of "correctly spelled" words.

Best,

    Niels

-- 
Niels Ott
Computational Linguist (B.A.)
http://www.drni.de/niels/

RE: Language recognition

Posted by Torsten Zesch <ze...@tk.informatik.tu-darmstadt.de>.

Hi Tommaso,

you could use TextCat
http://odur.let.rug.nl/~vannoord/TextCat/

or one of its competitors:
http://odur.let.rug.nl/~vannoord/TextCat/competitors.html

-Torsten 

> -----Original Message-----
> From: Tommaso Teofili [mailto:tommaso.teofili@gmail.com] 
> Sent: Monday, December 08, 2008 10:23 AM
> To: uima-user@incubator.apache.org
> Subject: Language recognition
> 
> Hello,
> I am writing an AE pipeline and i need to recognize in which 
> language the
> starting document is written.
> My idea is to use the Whitespace Tokenizer and the HMM Tagger 
> together in
> order to analyze the extracted tokens, calculate the 
> percentage of well
> known tokens for each language (against a dictionary) and 
> then select the
> highest percentage value language...
> Do you know other (better) language recognition methods?
> Thanks.
> Tommaso
>