You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by karl wettin <ka...@snigel.dnsalias.net> on 2004/02/01 22:07:13 UTC

N-gram layer

Hello list,

I'm Karl, and I just started testing Lucene the other day. It's a great
core engine, but feel there are some things missing I'd be happy to
contribute with. 

I stated with writing a simple N-gram classifier to detect language of
a text in order to automatically cluster documents by langauge. The 
algorithm is very similair to the "TextCat" C-libray. 

And then I though, maybe it would be possible to use the same N-gram 
classifier to make an automatic stemmer that works on all languages. 
Hopefully I'll have something up and running for tests by next weekend.

The same classifier could be used for a simple metaphone index.

However, I need some help on understanding the Analyzer. Where can I
find some tutorials on how to write my own? I didn't check with Google,
maybe I should before posting here. Since the stemmer (and metaphone)
data would have to be indexed in their own field(?) querying the stemmed
would require one to stem the query too. Can I create a subclass of 
Query (or so), or do I need to create my own Query-class that handles
the stemming all the way for the user? The last option is my current
approach, so I would appreciate some hints and pointers here.


Great project! 


karl



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: N-gram layer

Posted by karl wettin <ka...@snigel.dnsalias.net>.

On Sun, 1 Feb 2004 22:15:26 -0600
"Robert Engels" <re...@ix.netcom.com> wrote:

> Actually, you do not always need to store it in a field.
> 
> See the Phonetic Query patch I posted (which does Soundex, Metaphone,
> and can actually do any 'secondary' info query).

Now it hit me, I really don't need to store the stemmed document at all,
it would save quite a bit of disk to stem the indexed data in real time.

Silly me.


Is it an Analyzer or Query I want to subclass?


-- 
karl

http://sf.net/projects/silvertejp/ 

[abstract Human]<|--+--[Woman]<>-- +mother +child {0..*} --[Human]
                    \--[Man]<>-- +father +child {0..*} --[Human]

"arghhh .. it's all in geek" - objectmonkey.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: N-gram layer

Posted by karl wettin <ka...@snigel.dnsalias.net>.

On Tue, 03 Feb 2004 09:27:25 +0100
Andrzej Bialecki <ab...@getopt.org> wrote:

> 
> A question: what was your source for the representative hi-frequency 
> words in various languages? Was it your training corpus or some publication?

I use the data supplied with Gertjan van Noord:s TextCat distribution.

http://odur.let.rug.nl/~vannoord/TextCat/


-- 

karl

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: N-gram layer

Posted by Andrzej Bialecki <ab...@getopt.org>.

karl wettin wrote:

> On Tue, 03 Feb 2004 09:27:25 +0100
> Andrzej Bialecki <ab...@getopt.org> wrote:
> 
> 
>>If I run the above example, I get the following:
>>
>>  "jag heter kalle"
>><input> - SV:   0.7197875
> 
> 
> What is index 1.0 ?

1.0 - completely dissimilar language profiles
0.0 - completely similar language profiles

However, it is not a pure cosine measure of two vectors (input text and 
language profile) in n-gram space. I had to do some tricky tuning, too...

Getting good results for such short texts using just statistical 
analysis is largely guessing, heuristics, a bit of cheating, and a good 
portion of pure luck... IOW, just magic. :-)


-- 
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: N-gram layer

Posted by karl wettin <ka...@snigel.dnsalias.net>.

On Tue, 03 Feb 2004 09:27:25 +0100
Andrzej Bialecki <ab...@getopt.org> wrote:

> 
> If I run the above example, I get the following:
> 
>   "jag heter kalle"
> <input> - SV:   0.7197875

What is index 1.0 ?


-- 

karl

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: Transaction support

Posted by Oliver Zeigermann <oz...@apache.org>.

I have implemented some sort of transactional file system for the 
Jakarta Slide project. I should be fairly easy to adapt Lucene to it, I 
guess. Although, as it occasionally uses copying you should keep index 
files that are updated rather small.

Oliver

Vladimir Lukin wrote:

> Has anybody ever thought of adding transaction support to Lucene?
> I mean that it would be pretty fun to add rollback and commit methods to
> IndexWriter so that one could rollback any changes if any error has
> encountered. Or commit them without closing the writer. Maybe someone
> has already implemented something like this?
> 
> Vladimir.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 
> 
> .
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: Transaction support

Posted by Andi Vajda <an...@osafoundation.org>.

This can be done by implementing an org.apache.lucene.store.Directory that
supports transactions and was last done with a Berkeley DB based
implementation, part of the sand box I believe.

Andi..

On Wed, 4 Feb 2004, Vladimir Lukin wrote:

>
> Has anybody ever thought of adding transaction support to Lucene?
> I mean that it would be pretty fun to add rollback and commit methods to
> IndexWriter so that one could rollback any changes if any error has
> encountered. Or commit them without closing the writer. Maybe someone
> has already implemented something like this?
>
> Vladimir.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: Transaction support

Posted by Otis Gospodnetic <ot...@yahoo.com>.

I believe somebody has already implemented this and contributed it in
some 2 months ago.
Unfortunately, nobody has reviewed nor committed the contribution to
Lucene's CVS yet.

Otis

--- Vladimir Lukin <vl...@yandex.ru> wrote:
> 
> Has anybody ever thought of adding transaction support to Lucene?
> I mean that it would be pretty fun to add rollback and commit methods
> to
> IndexWriter so that one could rollback any changes if any error has
> encountered. Or commit them without closing the writer. Maybe someone
> has already implemented something like this?
> 
> Vladimir.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Transaction support

Posted by Vladimir Lukin <vl...@yandex.ru>.

Has anybody ever thought of adding transaction support to Lucene?
I mean that it would be pretty fun to add rollback and commit methods to
IndexWriter so that one could rollback any changes if any error has
encountered. Or commit them without closing the writer. Maybe someone
has already implemented something like this?

Vladimir.


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: N-gram layer

Posted by Tatu Saloranta <ta...@hypermall.net>.

On Tuesday 03 February 2004 02:18, karl wettin wrote:
> On Tue, 3 Feb 2004 09:54:19 +0100
>
> karl wettin <ka...@snigel.dnsalias.net> wrote:
> > test has a weight of 1731 in Swedish
> > test has a weight of 1726 in Danish
>
> Oh dear. Mine fails too.

Considering swedish, danish and norwegian languages are very similar to each 
other, it's probably one of tougher cases to distinguish? And even more so 
for example of "jag heter Kalle", where one word is proper noun, not language 
word? I guess what I'm saying is that being heuristics, it's less dangerous 
to mix between languages that are similar, than with more distant ones.

-+ Tatu +-

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: N-gram layer

Posted by karl wettin <ka...@snigel.dnsalias.net>.

On Tue, 3 Feb 2004 09:54:19 +0100
karl wettin <ka...@snigel.dnsalias.net> wrote:

> 
> 
> test has a weight of 1731 in Swedish
> test has a weight of 1726 in Danish

Oh dear. Mine fails too.


-- 

karl

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: N-gram layer

Posted by karl wettin <ka...@snigel.dnsalias.net>.

On Tue, 03 Feb 2004 09:27:25 +0100
Andrzej Bialecki <ab...@getopt.org> wrote:

> 
> However, for the text "vad heter du" (what's your name) the detection 
> fails... :-)

I'm sorry for my multiple replys.. 

1->5 grams and penalty:

vad heter du

test has a weight of 1731 in Swedish
test has a weight of 1726 in Danish
test has a weight of 1789 in Norwegian
test has a weight of 2037 in Afrikaans
test has a weight of 2274 in Dutch


I  recomment you to add penalties. I add (max distance * 4).
There is no science what so ever behind that number.



karl

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: N-gram layer

Posted by Andrzej Bialecki <ab...@getopt.org>.

karl wettin wrote:
> On Mon, 2 Feb 2004 20:10:57 +0100
> "Jean-Francois Halleux" <ha...@skynet.be> wrote:
> 
> 
>>during the past days, I've developped such a language guesser myself
>>as a basis for a Lucene analyzer. It is based on trigrams. It is
>>already working but not yet in a "publishable" state. If you or others
>>are interested I can offer the sources.
> 
> 
> I use variable gramsize due to the toughness of detecting thelanguage of
> very small texts such as a query. For instance, applying bi->quadgram on
> the swedish sentance "Jag heter Karl" (my name is Karl) is presumed to
> be in Afrikaans. Using uni->quadgram does the trick.
> 
> Also, I add peneltys for gram-sized words found the the text but not in
> the classified language. This improved my results even more. 
> 
> And I've been considering applying markov-chains on the grams where it
> still is hard to guess the language, such as Afrikaans vs. Dutch and
> American vs. Brittish English.
> 
> Let me know if you want a copy of my code. 
> 
> 
> Here is some testoutput:
> 
[...]
> As you see, single word penalty on uni->quad does the trick on even the
> smallest of textstrings.

Well, perhaps it's also a matter of the quality of the language 
profiles. In one of my projects I'm using language profiles constructed 
from 1-5 -grams, with total of 300 grams per language profile. I don't 
do any additional tricks with penalizing the high frequency words.

If I run the above example, I get the following:

  "jag heter kalle"
<input> - SV:   0.7197875
<input> - DN:   0.745925
<input> - NO:   0.747225
<input> - FI:   0.755475
<input> - NL:   0.7597125
<input> - EN:   0.76746875
<input> - FR:   0.77628125
<input> - GE:   0.7785125
<input> - IT:   0.796675
<input> - PL:   0.7984875
<input> - PT:   0.7995875
<input> - ES:   0.800775
<input> - RU:   0.88500625

However, for the text "vad heter du" (what's your name) the detection 
fails... :-)

A question: what was your source for the representative hi-frequency 
words in various languages? Was it your training corpus or some publication?

-- 
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: N-gram layer

Posted by karl wettin <ka...@snigel.dnsalias.net>.

On Mon, 2 Feb 2004 20:10:57 +0100
"Jean-Francois Halleux" <ha...@skynet.be> wrote:

> during the past days, I've developped such a language guesser myself
> as a basis for a Lucene analyzer. It is based on trigrams. It is
> already working but not yet in a "publishable" state. If you or others
> are interested I can offer the sources.

I use variable gramsize due to the toughness of detecting thelanguage of
very small texts such as a query. For instance, applying bi->quadgram on
the swedish sentance "Jag heter Karl" (my name is Karl) is presumed to
be in Afrikaans. Using uni->quadgram does the trick.

Also, I add peneltys for gram-sized words found the the text but not in
the classified language. This improved my results even more. 

And I've been considering applying markov-chains on the grams where it
still is hard to guess the language, such as Afrikaans vs. Dutch and
American vs. Brittish English.

Let me know if you want a copy of my code. 

Here is some testoutput:

test = "jag heter kalle." 

WITH SINGLE WORD PENALTYS:

uni->quad-gram

test has a weight of 1600 in Swedish
test has a weight of 1848 in Afrikaans
test has a weight of 1928 in Dutch
test has a weight of 2021 in Danish
test has a weight of 2011 in Norwegian

bi->quad-gram

test has a weight of 1024 in Swedish
test has a weight of 1199 in Afrikaans
test has a weight of 1356 in Dutch
test has a weight of 1376 in Danish
test has a weight of 1434 in Norwegian

tri-gram only

test has a weight of 190 in Norwegian
test has a weight of 212 in Afrikaans
test has a weight of 221 in Swedish
test has a weight of 236 in Danish
test has a weight of 237 in Dutch

WITHOUT SINGLE WORD PENALTY:

uni->quad-gram

test has a weight of 1448 in Afrikaans
test has a weight of 1528 in Dutch
test has a weight of 1600 in Swedish
test has a weight of 1611 in Norwegian
test has a weight of 1621 in Danish

bi->quad-gram

test has a weight of 799 in Afrikaans
test has a weight of 956 in Dutch
test has a weight of 976 in Danish
test has a weight of 1024 in Swedish
test has a weight of 1034 in Norwegian

tri-gram only

test has a weight of 190 in Norwegian
test has a weight of 212 in Afrikaans
test has a weight of 221 in Swedish
test has a weight of 236 in Danish
test has a weight of 237 in Dutch

As you see, single word penalty on uni->quad does the trick on even the
smallest of textstrings.

karl

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: N-gram layer

Posted by Andrzej Bialecki <ab...@getopt.org>.

karl wettin wrote:
> On Sun, 1 Feb 2004 13:12:32 -0800 (PST)
> Otis Gospodnetic <ot...@yahoo.com> wrote:
> 
> 
>>Looking forward to the contribution.
> 
> 
> Sorry for the delay, but I've had quite some workload lately, and then I
> moved between apartments. I'm back and I'm ready to spend some time.
> 
> I gave up detecting the language of a query. It is very possbile indeed
> and I got great results with Weka, but takes too much time: 5-50 seconds
> on my Pentium M. 
> 
> However, I'm still working on the "autoanalytic stemmer", atleast in my
> head. I've started to feed my index with docuemnts tagged with the
> language in a field, and thought it should analyze (still the n-gram
> approach) all  words of a specific language to find stemming rules for
> each and every language. The output can be used per language stemming,
> BUT hopefully I'll be able to use this data to create my generic
> stemmer.
> 
> The language models and inflectional form extraction should be based on
> the index content, but I can't seem to find out how to access the terms
> of a specific set of documents. Of course, I could just query my index
> and start working on the data, building my own trie-pattern, but I'm 
> sure I don't have to.

Please take a look at http://www.egothor.org, and its stemmer package -
it does exactly this, and it's based on a solid research... :-) In my
experience, the stemmers built with this package work exceptionally
well, even for complex inflection-rich languages like the Slavic family.

However, you need to always know the language of the document in advance
- my belief is that it's impossible to build a "universal stemmer good
for any language".

-- 
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: N-gram layer

Posted by karl wettin <ka...@snigel.dnsalias.net>.

On Sun, 1 Feb 2004 13:12:32 -0800 (PST)
Otis Gospodnetic <ot...@yahoo.com> wrote:

> Looking forward to the contribution.

Sorry for the delay, but I've had quite some workload lately, and then I
moved between apartments. I'm back and I'm ready to spend some time.

I gave up detecting the language of a query. It is very possbile indeed
and I got great results with Weka, but takes too much time: 5-50 seconds
on my Pentium M. 

However, I'm still working on the "autoanalytic stemmer", atleast in my
head. I've started to feed my index with docuemnts tagged with the
language in a field, and thought it should analyze (still the n-gram
approach) all  words of a specific language to find stemming rules for
each and every language. The output can be used per language stemming,
BUT hopefully I'll be able to use this data to create my generic
stemmer.

The language models and inflectional form extraction should be based on
the index content, but I can't seem to find out how to access the terms
of a specific set of documents. Of course, I could just query my index
and start working on the data, building my own trie-pattern, but I'm 
sure I don't have to.

I've been browsing the list archives and API for several days without
finding out how to iterate the (distinct/unique) terms of the index
or a specific set of documents. 

How do I do that? 



-- 

karl

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Okapi

Posted by Wisam Dakka <wi...@cs.columbia.edu>.

Hi All,

	I am new here. I am planning to add Okapi score to Lucene. Any idea
where should I start?

Thanks



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: N-gram layer

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Feb 2, 2004, at 2:10 PM, Jean-Francois Halleux wrote:
> Hi Karl,
>
> 	during the past days, I've developped such a language guesser myself 
> as a
> basis for a Lucene analyzer. It is based on trigrams. It is already 
> working
> but not yet in a "publishable" state. If you or others are interested 
> I can
> offer the sources.

Nice!  If you'd like your work to be part of the analyzers section, 
feel free to code it such that it will fit right in.  Any improvement 
in that area is more than welcome, test cases included :)

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

RE: N-gram layer

Posted by Jean-Francois Halleux <ha...@skynet.be>.

Hi Karl,

	during the past days, I've developped such a language guesser myself as a
basis for a Lucene analyzer. It is based on trigrams. It is already working
but not yet in a "publishable" state. If you or others are interested I can
offer the sources.

KR,

Jean-Francois Halleux

-----Original Message-----
From: karl wettin [mailto:kalle@snigel.dnsalias.net]
Sent: dimanche 1 fevrier 2004 22:07
To: lucene-dev@jakarta.apache.org
Subject: N-gram layer



Hello list,

I'm Karl, and I just started testing Lucene the other day. It's a great
core engine, but feel there are some things missing I'd be happy to
contribute with.

I stated with writing a simple N-gram classifier to detect language of
a text in order to automatically cluster documents by langauge. The
algorithm is very similair to the "TextCat" C-libray.

And then I though, maybe it would be possible to use the same N-gram
classifier to make an automatic stemmer that works on all languages.
Hopefully I'll have something up and running for tests by next weekend.

The same classifier could be used for a simple metaphone index.

However, I need some help on understanding the Analyzer. Where can I
find some tutorials on how to write my own? I didn't check with Google,
maybe I should before posting here. Since the stemmer (and metaphone)
data would have to be indexed in their own field(?) querying the stemmed
would require one to stem the query too. Can I create a subclass of
Query (or so), or do I need to create my own Query-class that handles
the stemming all the way for the user? The last option is my current
approach, so I would appreciate some hints and pointers here.


Great project!


karl



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

RE: N-gram layer

Posted by Robert Engels <re...@ix.netcom.com>.

Actually, you do not always need to store it in a field.

See the Phonetic Query patch I posted (which does Soundex, Metaphone, and
can actually do any 'secondary' info query).

Robert Engels

-----Original Message-----
From: karl wettin [mailto:kalle@snigel.dnsalias.net]
Sent: Sunday, February 01, 2004 3:07 PM
To: lucene-dev@jakarta.apache.org
Subject: N-gram layer



Hello list,

I'm Karl, and I just started testing Lucene the other day. It's a great
core engine, but feel there are some things missing I'd be happy to
contribute with.

I stated with writing a simple N-gram classifier to detect language of
a text in order to automatically cluster documents by langauge. The
algorithm is very similair to the "TextCat" C-libray.

And then I though, maybe it would be possible to use the same N-gram
classifier to make an automatic stemmer that works on all languages.
Hopefully I'll have something up and running for tests by next weekend.

The same classifier could be used for a simple metaphone index.

However, I need some help on understanding the Analyzer. Where can I
find some tutorials on how to write my own? I didn't check with Google,
maybe I should before posting here. Since the stemmer (and metaphone)
data would have to be indexed in their own field(?) querying the stemmed
would require one to stem the query too. Can I create a subclass of
Query (or so), or do I need to create my own Query-class that handles
the stemming all the way for the user? The last option is my current
approach, so I would appreciate some hints and pointers here.


Great project!


karl



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: N-gram layer

Posted by Otis Gospodnetic <ot...@yahoo.com>.

The best Analyzer documentation so far is Erik Hatcher's "Parser Rulez"
article.  Link is under Resources page on Lucene's site.

Looking forward to the contribution.

Otis


--- karl wettin <ka...@snigel.dnsalias.net> wrote:
> 
> Hello list,
> 
> I'm Karl, and I just started testing Lucene the other day. It's a
> great
> core engine, but feel there are some things missing I'd be happy to
> contribute with. 
> 
> I stated with writing a simple N-gram classifier to detect language
> of
> a text in order to automatically cluster documents by langauge. The 
> algorithm is very similair to the "TextCat" C-libray. 
> 
> And then I though, maybe it would be possible to use the same N-gram 
> classifier to make an automatic stemmer that works on all languages. 
> Hopefully I'll have something up and running for tests by next
> weekend.
> 
> The same classifier could be used for a simple metaphone index.
> 
> However, I need some help on understanding the Analyzer. Where can I
> find some tutorials on how to write my own? I didn't check with
> Google,
> maybe I should before posting here. Since the stemmer (and metaphone)
> data would have to be indexed in their own field(?) querying the
> stemmed
> would require one to stem the query too. Can I create a subclass of 
> Query (or so), or do I need to create my own Query-class that handles
> the stemming all the way for the user? The last option is my current
> approach, so I would appreciate some hints and pointers here.
> 
> 
> Great project! 
> 
> 
> karl
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org