You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by "Strittmatter Stephan (external)" <St...@kst.siemens.de> on 2001/11/23 14:56:09 UTC

Automatically determin Language of document

Hi,

has anyone done anything to autodetect Language of an
HTML-Document which will be indexed by Lucene?

I will use Lucene to index an multilingual Portal
and want to filter the hits by language.

Thanks for any ideas,

Stephan

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


RE: Automatically determin Language of document

Posted by Elie Naulleau <se...@wanadoo.fr>.
Hi,
You could try Doug Beeferman's variable-length character n-gram approach
to identify a language among 13 european ones.
http://www.dougb.com/ident.html

I tried it and it works pretty well. It's based on a similarity mesure
(cosine)
between a corpus-model and the input text.

There are issues depending on which character set you used (iso-latin1,
or other asci flavor).

If you just have 4 or 5 languages to deal with, you can build your
own with the most frequent word lists for each language. I have some
trivial C++ code that does it and can send it to you it you need.
Identified language is choosen on a frequency criterion.

Of course commercial product are available for that (try Xerox & inXight
for instance $$).

The point is how many language you have to identify...

Complement:
Some time ago, Bright Station (UK) had some open source C/C++ code
 for a variety of stemmers for european language (adapted from
the Porter stemmer approach).

I hope this helps,
Elie Naulleau
Semio-Sys


-----Message d'origine-----
De : Strittmatter Stephan (external)
[mailto:Stephan.Strittmatter.ext@kst.siemens.de]
Envoyé : vendredi 23 novembre 2001 14:56
À : 'Lucene Users List'
Objet : Automatically determin Language of document


Hi,

has anyone done anything to autodetect Language of an
HTML-Document which will be indexed by Lucene?

I will use Lucene to index an multilingual Portal
and want to filter the hits by language.

Thanks for any ideas,

Stephan

--
To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
For additional commands, e-mail:
<ma...@jakarta.apache.org>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>