You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Erik Hatcher <er...@ehatchersolutions.com> on 2003/10/06 19:49:13 UTC

SnowballAnalyzer

At one point, I believe, it was proposed to bring the sandbox 
SnowballAnalyzer into the core.  Is this still desired or shall we just 
leave it in the sandbox?

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: SnowballAnalyzer

Posted by Leo Galambos <Le...@seznam.cz>.

Hi Pete,

IMHO you could also use stemmers which are 1) faster 2) more accurate 3) 
able to learn and process *any* language 4) able to work as 
lemmatiser/guesser. I know two algorithms which have all the properties:

The first one is based on Jan Daciuk's MFSA, and the second one is, ehm 
no self-promotion ;-), my method. The comparison of these two methods is 
here: http://www.egothor.org/temp/us-0E2-cmp.png (English dictionary)

My method was designed for IR systems thus it gives better accuracy in 
such environments. I was also interested in compound words (->German) 
thus I can offer you a multilevel stemmer which do the job. Elsewhere 
you may have better results with Jan's method.

Leo

Pete Lewis wrote:

>Hi all
>
>I know that I have no vote but I think that it would be wrong to bring the SnowballAnalyzer into the core.
>
>There are some distinct limitations with this pure algorithmic approach.  Yes it would be great to say 'hey, we have 14 languages covered' but you should first realise the limitations of the product.  Lets start with some definitions....
>
>'Stemming' signifies the process of finding the stems in words. 'Lemmatisation' is the process of reducing the word form to its 'lemma' form, i.e. the form one expects to find in a dictionary. The difference are:
>
>1.      In many language the dictionary form is not the stem. E.g. in Dutch the infinitive verb is not its stem.
>
>2.      Words may have several stems due to composition (common in Germanic languages).
>
>The terms are both used extremely loosely in the literature, where they often indicate the same thing.
>
>
>
>A tool often used for English is the Porter-stemmer. Strictly speaking, it is neither a stemmer nor a lemmatiser; it cuts off certain characters on the basis of characters before them. In many cases morphologically equivalent forms reduce to the same root form. There have been efforts to create similar type algorithmic tools for other languages. Porter has lately designed a language called Snowball, to create scripts for performing these reductions. Snowball has been applied for a number of languages. In many cases these scripts are available for the public. Snowball is not capable of handling composition. Nor is it capable of handling other more demanding morphological patterns, such as agglutination and infixes.
>
>
>
>Basically people would expect the terms in the search clue to be reduced to the same root form as that used for indexing and hence would then be able to find the different derivations of the term (plurals etc).
>
>
>
>Some examples from Snowball should speak for themselves:
>
>
>
>bus -> bus
>
>buses -> buse
>
>catch -> catch
>
>caught -> caught
>
>manage -> manag
>
>management -> manag
>
>
>
>showing incorrect handling of plurals, irregs, and mixing verbs & nouns.  Obviously many other examples can be found.
>
>
>
>While this isn't too bad for English it gets pretty dire for other languages.
>
>
>
>For English I'd prefer KStem rather than Snowball.
>
>
>
>Cheers
>
>
>
>Pete
>
>
>
>
>
>----- Original Message ----- 
>From: "Erik Hatcher" <er...@ehatchersolutions.com>
>To: "Lucene List" <lu...@jakarta.apache.org>
>Sent: Monday, October 06, 2003 6:49 PM
>Subject: SnowballAnalyzer
>
>
>  
>
>>At one point, I believe, it was proposed to bring the sandbox 
>>SnowballAnalyzer into the core.  Is this still desired or shall we just 
>>leave it in the sandbox?
>>
>>Erik
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>>
>>    
>>
>> 
>  
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: SnowballAnalyzer

Posted by Pete Lewis <pe...@uptima.co.uk>.

Hi all

I know that I have no vote but I think that it would be wrong to bring the SnowballAnalyzer into the core.

There are some distinct limitations with this pure algorithmic approach.  Yes it would be great to say 'hey, we have 14 languages covered' but you should first realise the limitations of the product.  Lets start with some definitions....

'Stemming' signifies the process of finding the stems in words. 'Lemmatisation' is the process of reducing the word form to its 'lemma' form, i.e. the form one expects to find in a dictionary. The difference are:

1.      In many language the dictionary form is not the stem. E.g. in Dutch the infinitive verb is not its stem.

2.      Words may have several stems due to composition (common in Germanic languages).

The terms are both used extremely loosely in the literature, where they often indicate the same thing.



A tool often used for English is the Porter-stemmer. Strictly speaking, it is neither a stemmer nor a lemmatiser; it cuts off certain characters on the basis of characters before them. In many cases morphologically equivalent forms reduce to the same root form. There have been efforts to create similar type algorithmic tools for other languages. Porter has lately designed a language called Snowball, to create scripts for performing these reductions. Snowball has been applied for a number of languages. In many cases these scripts are available for the public. Snowball is not capable of handling composition. Nor is it capable of handling other more demanding morphological patterns, such as agglutination and infixes.



Basically people would expect the terms in the search clue to be reduced to the same root form as that used for indexing and hence would then be able to find the different derivations of the term (plurals etc).



Some examples from Snowball should speak for themselves:



bus -> bus

buses -> buse

catch -> catch

caught -> caught

manage -> manag

management -> manag



showing incorrect handling of plurals, irregs, and mixing verbs & nouns.  Obviously many other examples can be found.



While this isn't too bad for English it gets pretty dire for other languages.



For English I'd prefer KStem rather than Snowball.



Cheers



Pete





----- Original Message ----- 
From: "Erik Hatcher" <er...@ehatchersolutions.com>
To: "Lucene List" <lu...@jakarta.apache.org>
Sent: Monday, October 06, 2003 6:49 PM
Subject: SnowballAnalyzer


> At one point, I believe, it was proposed to bring the sandbox 
> SnowballAnalyzer into the core.  Is this still desired or shall we just 
> leave it in the sandbox?
> 
> Erik
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 
>

Re: SnowballAnalyzer

Posted by Mark Woon <mo...@helix.stanford.edu>.

Erik Hatcher wrote:

> At one point, I believe, it was proposed to bring the sandbox
> SnowballAnalyzer into the core.
>

+1!


-Mark

Re: SnowballAnalyzer

Posted by Otis Gospodnetic <ot...@yahoo.com>.

I don't know if I replied to this or not.  My opinion below.

--- Erik Hatcher <er...@ehatchersolutions.com> wrote:
> On Tuesday, October 7, 2003, at 05:25  AM, Otis Gospodnetic wrote:
> > My vote goes to leaving it in the Sandbox, for the same reasons I
> > mentioned the other day for some other similar component.
> >
> > As a matter of fact, I have been wondering if we should move
> Russian
> > and German code out of the core into the Sandbox.
> 
> I would be +1 on moving it out too.  But where do you draw the line
> on what Analyzers go in the core, then?

I would keep only the 'core' ones (Whitespace/Simple/Standard), even if
they have English-specific code in them.  I hate assuming English as
THE language, even though it is THE language in practise, but I don't
see a better way of keeping language-specific code out of the core.  In
my opinion, an ideal setup would be to keep the W/S/S in the core, and
all others in a Sandbox.  Analyzers in the Sandbox would be nicely
organized and would be in a stable state, so a simple 'ant jar' can
package everything up and let the developer just move the created Jar
to the appropriate directory in his environment.
I would keep those contributed Analyzers separate from the Snowball
ones, so their origins, etc. are clear.

I have a Brasilian Portuguese Analyzer sitting in the queue (read: my
email account's inbox), and I think I even have some code that somebody
sent for Chinese, other than CJK support from Che Dong.  This has been
waiting for my free time for months now... :(

Otis

__________________________________
Do you Yahoo!?
The New Yahoo! Shopping - with improved product search
http://shopping.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: SnowballAnalyzer

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Tuesday, October 7, 2003, at 05:25  AM, Otis Gospodnetic wrote:
> My vote goes to leaving it in the Sandbox, for the same reasons I
> mentioned the other day for some other similar component.
>
> As a matter of fact, I have been wondering if we should move Russian
> and German code out of the core into the Sandbox.

I would be +1 on moving it out too.  But where do you draw the line on 
what Analyzers go in the core, then?



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: SnowballAnalyzer

Posted by Otis Gospodnetic <ot...@yahoo.com>.

My vote goes to leaving it in the Sandbox, for the same reasons I
mentioned the other day for some other similar component.

As a matter of fact, I have been wondering if we should move Russian
and German code out of the core into the Sandbox.

Otis

--- Erik Hatcher <er...@ehatchersolutions.com> wrote:
> At one point, I believe, it was proposed to bring the sandbox 
> SnowballAnalyzer into the core.  Is this still desired or shall we
> just 
> leave it in the sandbox?
> 
> 	Erik
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 

__________________________________
Do you Yahoo!?
The New Yahoo! Shopping - with improved product search
http://shopping.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org