You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Karsten Konrad <Ka...@xtramind.com> on 2004/02/03 11:39:40 UTC

AW: N-gram layer and language guessing

Hi,

does anybody here use a ngram-layer for fault-tolerant searching 
on *larger* texts? I ask because you can expect to see far more 
ngrams than words emerging from a text once you use at least
quad-grams - and the number of different tokens indexed seems to 
be the most important parameter for Lucene's search speed.

Anyway, XtraMind's ngram language guesser gives the following 
best five results on the swedish examples discussed previously:

"jag heter kalle"

swedish 100,00 %
norwegian 17,51 %
danish 10,02 %
africaans 9,53 %
dutch 8,79 %

"vad heter du"

swedish 100,00 %
dutch 20,97 %
norwegian 14,68 %
danish 11,07 %
africaans 9,29 %

The guesser uses only tri- and quad-grams and is based on
a sophisticated machine learning algorithm instead of a raw
TF/IDF-weighting. The upside of this is the "confidence" 
value for estimating how much you can trust the 
classification. The downside is the model size: 5MB for 15 
languages, which comes mostly from using quad-grams - our 
machine learners don't do feature selection very well.

Mit freundlichen Grüßen aus Saarbrücken

--

Dr.-Ing. Karsten Konrad
Head of Artificial Intelligence Lab

XtraMind Technologies GmbH
Stuhlsatzenhausweg 3
D-66123 Saarbrücken
Phone: +49 (681) 3025113
Fax: +49 (681) 3025109
konrad@xtramind.com
www.xtramind.com



-----Ursprüngliche Nachricht-----
Von: Andrzej Bialecki [mailto:ab@getopt.org] 
Gesendet: Dienstag, 3. Februar 2004 09:27
An: Lucene Developers List
Betreff: Re: N-gram layer


karl wettin wrote:
> On Mon, 2 Feb 2004 20:10:57 +0100
> "Jean-Francois Halleux" <ha...@skynet.be> wrote:
> 
> 
>>during the past days, I've developped such a language guesser myself 
>>as a basis for a Lucene analyzer. It is based on trigrams. It is 
>>already working but not yet in a "publishable" state. If you or others 
>>are interested I can offer the sources.
> 
> 
> I use variable gramsize due to the toughness of detecting thelanguage 
> of very small texts such as a query. For instance, applying 
> bi->quadgram on the swedish sentance "Jag heter Karl" (my name is 
> Karl) is presumed to be in Afrikaans. Using uni->quadgram does the 
> trick.
> 
> Also, I add peneltys for gram-sized words found the the text but not 
> in the classified language. This improved my results even more.
> 
> And I've been considering applying markov-chains on the grams where it 
> still is hard to guess the language, such as Afrikaans vs. Dutch and 
> American vs. Brittish English.
> 
> Let me know if you want a copy of my code.
> 
> 
> Here is some testoutput:
> 
[...]
> As you see, single word penalty on uni->quad does the trick on even 
> the smallest of textstrings.

Well, perhaps it's also a matter of the quality of the language 
profiles. In one of my projects I'm using language profiles constructed 
from 1-5 -grams, with total of 300 grams per language profile. I don't 
do any additional tricks with penalizing the high frequency words.

If I run the above example, I get the following:

  "jag heter kalle"
<input> - SV:   0.7197875
<input> - DN:   0.745925
<input> - NO:   0.747225
<input> - FI:   0.755475
<input> - NL:   0.7597125
<input> - EN:   0.76746875
<input> - FR:   0.77628125
<input> - GE:   0.7785125
<input> - IT:   0.796675
<input> - PL:   0.7984875
<input> - PT:   0.7995875
<input> - ES:   0.800775
<input> - RU:   0.88500625

However, for the text "vad heter du" (what's your name) the detection 
fails... :-)

A question: what was your source for the representative hi-frequency 
words in various languages? Was it your training corpus or some publication?

-- 
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: AW: N-gram layer and language guessing

Posted by karl wettin <ka...@snigel.dnsalias.net>.
On Tue, 03 Feb 2004 12:47:06 +0100
Andrzej Bialecki <ab...@getopt.org> wrote:

> Karsten Konrad wrote:
> > The guesser uses only tri- and quad-grams and is based on
> > a sophisticated machine learning algorithm instead of a raw
> > TF/IDF-weighting. The upside of this is the "confidence" 
> > value for estimating how much you can trust the 
> > classification. The downside is the model size: 5MB for 15 
> > languages, which comes mostly from using quad-grams - our 
> > machine learners don't do feature selection very well.
> 
> Impressive. For comparision, my language models are roughly 3kB per 
> language, and the guesser works with nearly perfect accuracy for texts
> 
> longer than 10 words. Below that - it depends.. :-)

Impressive indeed. However, it is quite important that one can detect
the language of a query: a query is not very often 10 words. And it 
is the query I want to detect the laguange of when stemming.

Karsten, what specifics can you tell us about the algorithms? 

I'm going to take a look at Weka tonight and see if there I could
implement something like this for Lucene.



kalle

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: AW: N-gram layer and language guessing

Posted by Andrzej Bialecki <ab...@getopt.org>.
Karsten Konrad wrote:
> Hi,
> 
> does anybody here use a ngram-layer for fault-tolerant searching 
> on *larger* texts? I ask because you can expect to see far more 
> ngrams than words emerging from a text once you use at least
> quad-grams - and the number of different tokens indexed seems to 
> be the most important parameter for Lucene's search speed.
> 
> Anyway, XtraMind's ngram language guesser gives the following 
> best five results on the swedish examples discussed previously:
> 
> "jag heter kalle"
> 
> swedish 100,00 %
> norwegian 17,51 %
> danish 10,02 %
> africaans 9,53 %
> dutch 8,79 %
> 
> "vad heter du"
> 
> swedish 100,00 %
> dutch 20,97 %
> norwegian 14,68 %
> danish 11,07 %
> africaans 9,29 %
> 
> The guesser uses only tri- and quad-grams and is based on
> a sophisticated machine learning algorithm instead of a raw
> TF/IDF-weighting. The upside of this is the "confidence" 
> value for estimating how much you can trust the 
> classification. The downside is the model size: 5MB for 15 
> languages, which comes mostly from using quad-grams - our 
> machine learners don't do feature selection very well.

Impressive. For comparision, my language models are roughly 3kB per 
language, and the guesser works with nearly perfect accuracy for texts 
longer than 10 words. Below that - it depends.. :-)

-- 
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: AW: N-gram layer and language guessing

Posted by karl wettin <ka...@snigel.dnsalias.net>.
On Tue, 3 Feb 2004 11:39:40 +0100
"Karsten Konrad" <Ka...@xtramind.com> wrote:

> 
> Anyway, XtraMind's ngram language guesser gives the following 
> best five results on the swedish examples discussed previously:
> 
> "jag heter kalle"
> 
> swedish 100,00 %
> norwegian 17,51 %
> danish 10,02 %
> africaans 9,53 %
> dutch 8,79 %
> 
> "vad heter du"
> 
> swedish 100,00 %
> dutch 20,97 %
> norwegian 14,68 %
> danish 11,07 %
> africaans 9,29 %


I spent all my time working on a better language guesser rather than
building the stemmer. The results I got from Weka are OK, but due to
the amount of calculations needed to guess the lagnuage of even the
shortest of strings, it is not possible for me to use these alogrithms.

Instead I'll do some experiments with markov-chains on the n-grams.
Hopefully this will yield quite a distinct difference between languages
without wating to many clockticks.

Any thoughts onthe subject is welcome.

I'll get back with results.

-- 

kalle


-- 

kalle

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org