You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by karl wettin <ka...@snigel.dnsalias.net> on 2004/03/12 02:54:22 UTC

Re: N-gram layer

On Sun, 1 Feb 2004 13:12:32 -0800 (PST)
Otis Gospodnetic <ot...@yahoo.com> wrote:

> Looking forward to the contribution.

Sorry for the delay, but I've had quite some workload lately, and then I
moved between apartments. I'm back and I'm ready to spend some time.

I gave up detecting the language of a query. It is very possbile indeed
and I got great results with Weka, but takes too much time: 5-50 seconds
on my Pentium M. 

However, I'm still working on the "autoanalytic stemmer", atleast in my
head. I've started to feed my index with docuemnts tagged with the
language in a field, and thought it should analyze (still the n-gram
approach) all  words of a specific language to find stemming rules for
each and every language. The output can be used per language stemming,
BUT hopefully I'll be able to use this data to create my generic
stemmer.

The language models and inflectional form extraction should be based on
the index content, but I can't seem to find out how to access the terms
of a specific set of documents. Of course, I could just query my index
and start working on the data, building my own trie-pattern, but I'm 
sure I don't have to.

I've been browsing the list archives and API for several days without
finding out how to iterate the (distinct/unique) terms of the index
or a specific set of documents. 

How do I do that? 



-- 

karl

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: N-gram layer

Posted by Andrzej Bialecki <ab...@getopt.org>.

karl wettin wrote:
> On Sun, 1 Feb 2004 13:12:32 -0800 (PST)
> Otis Gospodnetic <ot...@yahoo.com> wrote:
> 
> 
>>Looking forward to the contribution.
> 
> 
> Sorry for the delay, but I've had quite some workload lately, and then I
> moved between apartments. I'm back and I'm ready to spend some time.
> 
> I gave up detecting the language of a query. It is very possbile indeed
> and I got great results with Weka, but takes too much time: 5-50 seconds
> on my Pentium M. 
> 
> However, I'm still working on the "autoanalytic stemmer", atleast in my
> head. I've started to feed my index with docuemnts tagged with the
> language in a field, and thought it should analyze (still the n-gram
> approach) all  words of a specific language to find stemming rules for
> each and every language. The output can be used per language stemming,
> BUT hopefully I'll be able to use this data to create my generic
> stemmer.
> 
> The language models and inflectional form extraction should be based on
> the index content, but I can't seem to find out how to access the terms
> of a specific set of documents. Of course, I could just query my index
> and start working on the data, building my own trie-pattern, but I'm 
> sure I don't have to.

Please take a look at http://www.egothor.org, and its stemmer package -
it does exactly this, and it's based on a solid research... :-) In my
experience, the stemmers built with this package work exceptionally
well, even for complex inflection-rich languages like the Slavic family.

However, you need to always know the language of the document in advance
- my belief is that it's impossible to build a "universal stemmer good
for any language".

-- 
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org