You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "PROYECTA.Fernandez Garcia, Ivan" <pr...@iberia.es> on 2004/10/29 10:04:47 UTC

Question about PorterStemFilter class

Good morning everybody,

	We are using it in our Analyzer class and we have the following
questions:
		1º Why does it change 'y' to 'i' character using parser
method?.
		    Instance: study -> studi
		2º In our case, Lucene has searches 50 hits and is showed
the first one only.
		    If I comment new PorterStemFilter(ts) from our Analyzer
class. All 50 hits is showed. Why?

Thanks very much for your help.

> Iván Fernández García
> Proyecta Sistemas de Información
> 
> 
> 
> 
> 
---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.773 / Virus Database: 520 - Release Date: 05/10/2004
 

----------------------------------------------
Has decidido el mejor precio.  Has decidido IBERIA.com 
You´ve chosen the best price. You´ve chosen  IBERIA.com 
----------------------------------------------
http://www.iberia.com 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Question about PorterStemFilter class

Posted by Brian Goetz <br...@quiotix.com>.
The PorterStemmer class derives from Martin Porter's original 
implementation, with some bugs fixed (by me.)  I deliberately made the 
minimal modifications to simplify merging with future versions of the 
Porter stemmer (of which there were none.)

> I just this week joined the mailing list, and on this topic thought
> I'd mention that I've rewritten the PorterStemmer Java class, cleaning
> up whitespace and predeclaring all the Strings for better performance.
> It passes the file-in file-out test provided by Martin Porter (iow,
> no change from the existing algorithm). The source for mine was taken
> from his site -- I'm not sure of the origin of the one in Lucene. I
> could also add an Apache license to the top.


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Question on showing excerpts

Posted by Murray Altheim <m....@open.ac.uk>.
[I'm gathering that the consensus on the PorterStemmer is to deprecate
the existing one in favour of using Snowball, so I've dropped the issue.]

Another questio: the Lucene FAQ includes this question:

  35. How can I show excerpts with the hit results? How about
      highlighting the matched words?
  http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.search&toc=faq#q35

I'm interested in showing excerpts if the amount of effort isn't
enormous, and while I understand that for each document type the
results will be different, what I'm wondering is how to locate the
offsets within each search result that indicate the locations of
each hit within the searched document, so that I won't have to
duplicate Lucene's existing efforts in creating the excerpt.

Where might I find in the Lucene API or code the hooks I need?
Is this information readily available, or is it buried within
the engine or the index?

Sorry if this is obvious -- I couldn't locate it. And thanks
very much for any assistance (and no rush at all...).

Murray

......................................................................
Murray Altheim                    http://kmi.open.ac.uk/people/murray/
Knowledge Media Institute
The Open University, Milton Keynes, Bucks, MK7 6AA, UK               .

   The Rise of Pseudo Fascism -- David Neiwert
   Part 1: The Morphing of the Conservative Movement
     http://dneiwert.blogspot.com/2004_09_19_dneiwert_archive.html#109028353137888956
   Part 2: The Architecture of Fascism
     http://dneiwert.blogspot.com/2004_09_26_dneiwert_archive.html#109563628314780505
   Part 3: The Pseudo-Fascist Campaign
     http://dneiwert.blogspot.com/2004_10_03_dneiwert_archive.html#109596147171278590
   Part 4: The Apocalyptic One-Party State
     http://dneiwert.blogspot.com/2004_10_10_dneiwert_archive.html#109694976530359103
   Part 5: Warfare By Other Means
     http://dneiwert.blogspot.com/2004_10_17_dneiwert_archive.html#109755467135245579
   Part 6: Breaking Down the Barriers
     http://dneiwert.blogspot.com/2004_10_24_dneiwert_archive.html#109858062597237163
   Part 7: It Can Happen Here
     http://dneiwert.blogspot.com/2004_10_31_dneiwert_archive.html#109902109250035295

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Question about PorterStemFilter class

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
Dr. Porter even recommends going to Snowball, so we should probably  
deprecate PorterStemFilter in favor of Snowball.  I've just added this  
as an item on the Lucene2Whiteboard wiki page.

However, so as to not lose your changes - feel free to add them to a  
Bugzilla issue so that your code can stick around in case anyone else  
is interested in picking it up.

	Erik

On Nov 4, 2004, at 3:58 AM, Murray Altheim wrote:

> Brian Goetz wrote:
>> The original (Porter) version did indeed not correspond to any  
>> current Java coding standards -- because (a) it was translated to  
>> Java before such standards emerged (JDK 1.0) and (b) translated from  
>> another language.  I left these as is, rather than attempt to clean  
>> them up.
>>> Yes, I understand that. In looking at the code it is quite obviously
>>> of the same origin (i.e. from Martin Porter) but may be an older
>>> version (I couldn't quite tell). It doesn't come close to any normal
>>> Java programming guidelines, and while I didn't try to modify the
>>> latest version so that it would, entirely, because part of the
>>> reason I think whoever wrote the original code formatted it the way
>>> they did was to enable an easier reading of the comparison strings.
>>> I've tried to come to some compromise, as well as (as I mentioned)
>>> predeclaring the Strings so that the VM isn't constantly recreating
>>> them.
>
> Brian,
>
> Thanks -- understood (both messages). With code as complicated to
> read as the PorterStemmer, I can understand the reasons. In the
> end, is there any desire for me to post the updated class I've
> created, or should I assume that it will be deprecated in favour
> of the Snowball code? If there's no need for my version, I'm
> fine with that, I just wouldn't want to dump it for no reason,
> nor push insertion of it into Lucene against anyone's wishes.
>
> Cheers,
>
> Murray
>
> ......................................................................
> Murray Altheim                    http://kmi.open.ac.uk/people/murray/
> Knowledge Media Institute
> The Open University, Milton Keynes, Bucks, MK7 6AA, UK               .
>
>   The Rise of Pseudo Fascism -- David Neiwert
>   Part 1: The Morphing of the Conservative Movement
>      
> http://dneiwert.blogspot.com/ 
> 2004_09_19_dneiwert_archive.html#109028353137888956
>   Part 2: The Architecture of Fascism
>      
> http://dneiwert.blogspot.com/ 
> 2004_09_26_dneiwert_archive.html#109563628314780505
>   Part 3: The Pseudo-Fascist Campaign
>      
> http://dneiwert.blogspot.com/ 
> 2004_10_03_dneiwert_archive.html#109596147171278590
>   Part 4: The Apocalyptic One-Party State
>      
> http://dneiwert.blogspot.com/ 
> 2004_10_10_dneiwert_archive.html#109694976530359103
>   Part 5: Warfare By Other Means
>      
> http://dneiwert.blogspot.com/ 
> 2004_10_17_dneiwert_archive.html#109755467135245579
>   Part 6: Breaking Down the Barriers
>      
> http://dneiwert.blogspot.com/ 
> 2004_10_24_dneiwert_archive.html#109858062597237163
>   Part 7: It Can Happen Here
>      
> http://dneiwert.blogspot.com/ 
> 2004_10_31_dneiwert_archive.html#109902109250035295
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Question about PorterStemFilter class

Posted by Murray Altheim <m....@open.ac.uk>.
Brian Goetz wrote:
> The original (Porter) version did indeed not correspond to any current 
> Java coding standards -- because (a) it was translated to Java before 
> such standards emerged (JDK 1.0) and (b) translated from another 
> language.  I left these as is, rather than attempt to clean them up.
> 
>>Yes, I understand that. In looking at the code it is quite obviously
>>of the same origin (i.e. from Martin Porter) but may be an older
>>version (I couldn't quite tell). It doesn't come close to any normal
>>Java programming guidelines, and while I didn't try to modify the
>>latest version so that it would, entirely, because part of the
>>reason I think whoever wrote the original code formatted it the way
>>they did was to enable an easier reading of the comparison strings.
>>I've tried to come to some compromise, as well as (as I mentioned)
>>predeclaring the Strings so that the VM isn't constantly recreating
>>them.

Brian,

Thanks -- understood (both messages). With code as complicated to
read as the PorterStemmer, I can understand the reasons. In the
end, is there any desire for me to post the updated class I've
created, or should I assume that it will be deprecated in favour
of the Snowball code? If there's no need for my version, I'm
fine with that, I just wouldn't want to dump it for no reason,
nor push insertion of it into Lucene against anyone's wishes.

Cheers,

Murray

......................................................................
Murray Altheim                    http://kmi.open.ac.uk/people/murray/
Knowledge Media Institute
The Open University, Milton Keynes, Bucks, MK7 6AA, UK               .

   The Rise of Pseudo Fascism -- David Neiwert
   Part 1: The Morphing of the Conservative Movement
     http://dneiwert.blogspot.com/2004_09_19_dneiwert_archive.html#109028353137888956
   Part 2: The Architecture of Fascism
     http://dneiwert.blogspot.com/2004_09_26_dneiwert_archive.html#109563628314780505
   Part 3: The Pseudo-Fascist Campaign
     http://dneiwert.blogspot.com/2004_10_03_dneiwert_archive.html#109596147171278590
   Part 4: The Apocalyptic One-Party State
     http://dneiwert.blogspot.com/2004_10_10_dneiwert_archive.html#109694976530359103
   Part 5: Warfare By Other Means
     http://dneiwert.blogspot.com/2004_10_17_dneiwert_archive.html#109755467135245579
   Part 6: Breaking Down the Barriers
     http://dneiwert.blogspot.com/2004_10_24_dneiwert_archive.html#109858062597237163
   Part 7: It Can Happen Here
     http://dneiwert.blogspot.com/2004_10_31_dneiwert_archive.html#109902109250035295

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Question about PorterStemFilter class

Posted by Brian Goetz <br...@quiotix.com>.
The original (Porter) version did indeed not correspond to any current 
Java coding standards -- because (a) it was translated to Java before 
such standards emerged (JDK 1.0) and (b) translated from another 
language.  I left these as is, rather than attempt to clean them up.

> Yes, I understand that. In looking at the code it is quite obviously
> of the same origin (i.e. from Martin Porter) but may be an older
> version (I couldn't quite tell). It doesn't come close to any normal
> Java programming guidelines, and while I didn't try to modify the
> latest version so that it would, entirely, because part of the
> reason I think whoever wrote the original code formatted it the way
> they did was to enable an easier reading of the comparison strings.
> I've tried to come to some compromise, as well as (as I mentioned)
> predeclaring the Strings so that the VM isn't constantly recreating
> them.


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Question about PorterStemFilter class

Posted by Andi Vajda <an...@osafoundation.org>.
Actually, in the snowball package there are a bunch of non-english stemmers:

   5045 Tue Feb 24 10:55:20 PST 2004 net/sf/snowball/ext/DanishStemmer.class
   6786 Tue Feb 24 10:55:20 PST 2004 net/sf/snowball/ext/DutchStemmer.class
  10857 Tue Feb 24 10:55:20 PST 2004 net/sf/snowball/ext/EnglishStemmer.class
   9955 Tue Feb 24 10:55:20 PST 2004 net/sf/snowball/ext/FinnishStemmer.class
  12833 Tue Feb 24 10:55:20 PST 2004 net/sf/snowball/ext/FrenchStemmer.class
   5916 Tue Feb 24 10:55:20 PST 2004 net/sf/snowball/ext/German2Stemmer.class
   5580 Tue Feb 24 10:55:20 PST 2004 net/sf/snowball/ext/GermanStemmer.class
  12375 Tue Feb 24 10:55:20 PST 2004 net/sf/snowball/ext/ItalianStemmer.class
  14522 Tue Feb 24 10:55:20 PST 2004 net/sf/snowball/ext/KpStemmer.class
  21583 Tue Feb 24 10:55:20 PST 2004 net/sf/snowball/ext/LovinsStemmer.class
   4170 Tue Feb 24 10:55:20 PST 2004 net/sf/snowball/ext/NorwegianStemmer.class
   7916 Tue Feb 24 10:55:20 PST 2004 net/sf/snowball/ext/PorterStemmer.class
  11766 Tue Feb 24 10:55:20 PST 2004 net/sf/snowball/ext/PortugueseStemmer.class
   8318 Tue Feb 24 10:55:20 PST 2004 net/sf/snowball/ext/RussianStemmer.class
  12239 Tue Feb 24 10:55:20 PST 2004 net/sf/snowball/ext/SpanishStemmer.class
   4393 Tue Feb 24 10:55:20 PST 2004 net/sf/snowball/ext/SwedishStemmer.class

Andi..


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Question about PorterStemFilter class

Posted by Pete Lewis <pe...@uptima.co.uk>.
Snowball = Porter for English ;-)

Pete

----- Original Message ----- 
From: "Erik Hatcher" <er...@ehatchersolutions.com>
To: "Lucene Developers List" <lu...@jakarta.apache.org>
Sent: Friday, October 29, 2004 4:04 PM
Subject: Re: Question about PorterStemFilter class


> On Oct 29, 2004, at 10:39 AM, Murray Altheim wrote:
> > In short, I *think* I'm using a newer version of the code than the
> > one in the repository, plus I've cleaned it up.
> 
> Can you provide some tests that show differences in how it stems 
> between yours and the built-in one?
> 
> The PorterStemFilter is not used by any built-in Analyzers, so I 
> actually think we should move it out to the Analyzers Sandbox area or 
> deprecate it in favor of the Snowball stemmer.  Thoughts?
> 
> Erik
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Question about PorterStemFilter class

Posted by Murray Altheim <m....@open.ac.uk>.
Erik Hatcher wrote:
> On Oct 29, 2004, at 10:39 AM, Murray Altheim wrote:
> 
>>In short, I *think* I'm using a newer version of the code than the
>>one in the repository, plus I've cleaned it up.
> 
> 
> Can you provide some tests that show differences in how it stems 
> between yours and the built-in one?

Certainly, but the test I used is identical to the one that Martin
Porter provides: an input file and an output file. In running my
modified version against the provided output file (which is just a
long list of words), the output is identical. I made no algorithmic
changes to the code, only formatting and syntax-choice changes to
better conform to Java coding guidelines and the aforementioned
predeclaration of final Strings, which has no effect except for
performance.

The test files are identical to the ones on Martin Porter's web
page:

    http://www.tartarus.org/~martin/PorterStemmer/index.html

> The PorterStemFilter is not used by any built-in Analyzers, so I 
> actually think we should move it out to the Analyzers Sandbox area or 
> deprecate it in favor of the Snowball stemmer.  Thoughts?

None. As I mentioned, I'm new to this project and am not familiar
with the advantages of the Snowball stemmer. In reading through the
pages on SourceForge, e.g.,

    http://snowball.tartarus.org/texts/introduction.html

there are apparently pros and cons. But for myself, I'd leave it up
to those with more history in this project to make these kinds of
decisions.

Murray

......................................................................
Murray Altheim                    http://kmi.open.ac.uk/people/murray/
Knowledge Media Institute
The Open University, Milton Keynes, Bucks, MK7 6AA, UK               .

    [International terrorism] is a fantasy that has been exaggerated
    and distorted by politicians. It is a dark illusion that has
    spread unquestioned through governments around the world, the
    security services, and the international media. In an age when
    all the grand ideas have lost credibility, fear of a phantom
    enemy is all the politicians have left to maintain their power."

    The making of the terror myth, The Guardian
    http://www.guardian.co.uk/terrorism/story/0,12780,1327904,00.html

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Question about PorterStemFilter class

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Oct 29, 2004, at 10:39 AM, Murray Altheim wrote:
> In short, I *think* I'm using a newer version of the code than the
> one in the repository, plus I've cleaned it up.

Can you provide some tests that show differences in how it stems 
between yours and the built-in one?

The PorterStemFilter is not used by any built-in Analyzers, so I 
actually think we should move it out to the Analyzers Sandbox area or 
deprecate it in favor of the Snowball stemmer.  Thoughts?

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Question about PorterStemFilter class

Posted by Murray Altheim <m....@open.ac.uk>.
Erik Hatcher wrote:
> On Oct 29, 2004, at 10:09 AM, Otis Gospodnetic wrote:
> 
>>You should open a bug entry in Bugzilla and then attach your code to
>>it, with ASL on top.
> 
> 
> However, there is a PorterStemFilter built into Lucene.  Please compare 
> with that.
> 
> 	Erik

Erik,

Yes, I understand that. In looking at the code it is quite obviously
of the same origin (i.e. from Martin Porter) but may be an older
version (I couldn't quite tell). It doesn't come close to any normal
Java programming guidelines, and while I didn't try to modify the
latest version so that it would, entirely, because part of the
reason I think whoever wrote the original code formatted it the way
they did was to enable an easier reading of the comparison strings.
I've tried to come to some compromise, as well as (as I mentioned)
predeclaring the Strings so that the VM isn't constantly recreating
them.

In short, I *think* I'm using a newer version of the code than the
one in the repository, plus I've cleaned it up.

Murray

......................................................................
Murray Altheim                    http://kmi.open.ac.uk/people/murray/
Knowledge Media Institute
The Open University, Milton Keynes, Bucks, MK7 6AA, UK               .

    [International terrorism] is a fantasy that has been exaggerated
    and distorted by politicians. It is a dark illusion that has
    spread unquestioned through governments around the world, the
    security services, and the international media. In an age when
    all the grand ideas have lost credibility, fear of a phantom
    enemy is all the politicians have left to maintain their power."

    The making of the terror myth, The Guardian
    http://www.guardian.co.uk/terrorism/story/0,12780,1327904,00.html

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Question about PorterStemFilter class

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Oct 29, 2004, at 10:09 AM, Otis Gospodnetic wrote:
> You should open a bug entry in Bugzilla and then attach your code to
> it, with ASL on top.

However, there is a PorterStemFilter built into Lucene.  Please compare 
with that.

	Erik


>
> Thanks,
> Otis
>
> --- Murray Altheim <m....@open.ac.uk> wrote:
>
>> Erik Hatcher wrote:
>>> On Oct 29, 2004, at 4:04 AM, PROYECTA.Fernandez Garcia, Ivan wrote:
>>>
>>>> 	We are using it in our Analyzer class and we have the following
>>>> questions:
>>>> 		1º Why does it change 'y' to 'i' character using parser
>>>> method?.
>>>> 		    Instance: study -> studi
>>>
>>>
>>> That's what stemmers do.  This allows queries for "study" and
>> "studies"
>>> to match the same documents, for example.
>>>
>>>
>>>> 		2º In our case, Lucene has searches 50 hits and is showed
>>>> the first one only.
>>>> 		    If I comment new PorterStemFilter(ts) from our Analyzer
>>>> class. All 50 hits is showed. Why?
>>>
>>> You haven't provided enough information.   Please provide a simple
>>> short example that shows one document (that currently does not get
>>> found) being indexed along with the code for your analyzer, along
>> with
>>> a sample query that should match but doesn't.
>>
>> Erik,
>>
>> I just this week joined the mailing list, and on this topic thought
>> I'd mention that I've rewritten the PorterStemmer Java class,
>> cleaning
>> up whitespace and predeclaring all the Strings for better
>> performance.
>> It passes the file-in file-out test provided by Martin Porter (iow,
>> no change from the existing algorithm). The source for mine was taken
>> from his site -- I'm not sure of the origin of the one in Lucene. I
>> could also add an Apache license to the top.
>>
>> What would I need to do to contribute this file? Just fill out the
>> ASF IP form and then commit the file in CVS?
>>
>> Thanks,
>>
>> Murray
>>
>>
> ......................................................................
>> Murray Altheim
>> http://kmi.open.ac.uk/people/murray/
>> Knowledge Media Institute
>> The Open University, Milton Keynes, Bucks, MK7 6AA, UK
>> .
>>
>>     [International terrorism] is a fantasy that has been exaggerated
>>     and distorted by politicians. It is a dark illusion that has
>>     spread unquestioned through governments around the world, the
>>     security services, and the international media. In an age when
>>     all the grand ideas have lost credibility, fear of a phantom
>>     enemy is all the politicians have left to maintain their power."
>>
>>     The making of the terror myth, The Guardian
>>     http://www.guardian.co.uk/terrorism/story/0,12780,1327904,00.html
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Question about PorterStemFilter class

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hello Murray,

You should open a bug entry in Bugzilla and then attach your code to
it, with ASL on top.

Thanks,
Otis

--- Murray Altheim <m....@open.ac.uk> wrote:

> Erik Hatcher wrote:
> > On Oct 29, 2004, at 4:04 AM, PROYECTA.Fernandez Garcia, Ivan wrote:
> > 
> >>	We are using it in our Analyzer class and we have the following
> >>questions:
> >>		1� Why does it change 'y' to 'i' character using parser
> >>method?.
> >>		    Instance: study -> studi
> > 
> > 
> > That's what stemmers do.  This allows queries for "study" and
> "studies" 
> > to match the same documents, for example.
> > 
> > 
> >>		2� In our case, Lucene has searches 50 hits and is showed
> >>the first one only.
> >>		    If I comment new PorterStemFilter(ts) from our Analyzer
> >>class. All 50 hits is showed. Why?
> > 
> > You haven't provided enough information.   Please provide a simple 
> > short example that shows one document (that currently does not get 
> > found) being indexed along with the code for your analyzer, along
> with 
> > a sample query that should match but doesn't.
> 
> Erik,
> 
> I just this week joined the mailing list, and on this topic thought
> I'd mention that I've rewritten the PorterStemmer Java class,
> cleaning
> up whitespace and predeclaring all the Strings for better
> performance.
> It passes the file-in file-out test provided by Martin Porter (iow,
> no change from the existing algorithm). The source for mine was taken
> from his site -- I'm not sure of the origin of the one in Lucene. I
> could also add an Apache license to the top.
> 
> What would I need to do to contribute this file? Just fill out the
> ASF IP form and then commit the file in CVS?
> 
> Thanks,
> 
> Murray
> 
>
......................................................................
> Murray Altheim                   
> http://kmi.open.ac.uk/people/murray/
> Knowledge Media Institute
> The Open University, Milton Keynes, Bucks, MK7 6AA, UK              
> .
> 
>     [International terrorism] is a fantasy that has been exaggerated
>     and distorted by politicians. It is a dark illusion that has
>     spread unquestioned through governments around the world, the
>     security services, and the international media. In an age when
>     all the grand ideas have lost credibility, fear of a phantom
>     enemy is all the politicians have left to maintain their power."
> 
>     The making of the terror myth, The Guardian
>     http://www.guardian.co.uk/terrorism/story/0,12780,1327904,00.html
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Question about PorterStemFilter class

Posted by Murray Altheim <m....@open.ac.uk>.
Erik Hatcher wrote:
> On Oct 29, 2004, at 4:04 AM, PROYECTA.Fernandez Garcia, Ivan wrote:
> 
>>	We are using it in our Analyzer class and we have the following
>>questions:
>>		1º Why does it change 'y' to 'i' character using parser
>>method?.
>>		    Instance: study -> studi
> 
> 
> That's what stemmers do.  This allows queries for "study" and "studies" 
> to match the same documents, for example.
> 
> 
>>		2º In our case, Lucene has searches 50 hits and is showed
>>the first one only.
>>		    If I comment new PorterStemFilter(ts) from our Analyzer
>>class. All 50 hits is showed. Why?
> 
> You haven't provided enough information.   Please provide a simple 
> short example that shows one document (that currently does not get 
> found) being indexed along with the code for your analyzer, along with 
> a sample query that should match but doesn't.

Erik,

I just this week joined the mailing list, and on this topic thought
I'd mention that I've rewritten the PorterStemmer Java class, cleaning
up whitespace and predeclaring all the Strings for better performance.
It passes the file-in file-out test provided by Martin Porter (iow,
no change from the existing algorithm). The source for mine was taken
from his site -- I'm not sure of the origin of the one in Lucene. I
could also add an Apache license to the top.

What would I need to do to contribute this file? Just fill out the
ASF IP form and then commit the file in CVS?

Thanks,

Murray

......................................................................
Murray Altheim                    http://kmi.open.ac.uk/people/murray/
Knowledge Media Institute
The Open University, Milton Keynes, Bucks, MK7 6AA, UK               .

    [International terrorism] is a fantasy that has been exaggerated
    and distorted by politicians. It is a dark illusion that has
    spread unquestioned through governments around the world, the
    security services, and the international media. In an age when
    all the grand ideas have lost credibility, fear of a phantom
    enemy is all the politicians have left to maintain their power."

    The making of the terror myth, The Guardian
    http://www.guardian.co.uk/terrorism/story/0,12780,1327904,00.html

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Question about PorterStemFilter class

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Oct 29, 2004, at 4:04 AM, PROYECTA.Fernandez Garcia, Ivan wrote:
> 	We are using it in our Analyzer class and we have the following
> questions:
> 		1º Why does it change 'y' to 'i' character using parser
> method?.
> 		    Instance: study -> studi

That's what stemmers do.  This allows queries for "study" and "studies" 
to match the same documents, for example.

> 		2º In our case, Lucene has searches 50 hits and is showed
> the first one only.
> 		    If I comment new PorterStemFilter(ts) from our Analyzer
> class. All 50 hits is showed. Why?

You haven't provided enough information.   Please provide a simple 
short example that shows one document (that currently does not get 
found) being indexed along with the code for your analyzer, along with 
a sample query that should match but doesn't.

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Question about PorterStemFilter class

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Redirecting to lucene-user list.

Hello,

1) that's part of the porter stemming algorithm (see Dr. Porter's web
site for more info).  If you are indexing/searching Spanish text, you
probably shouldn't be using PorterStemFilter.

2) I can't tell from the description.  Could be one of those
unfortunate Field.Keyword and QueryParser interactions, where
Field.Keyword is not analyzed during indexing, but is analyzer during
searching.

Otis


--- "PROYECTA.Fernandez Garcia, Ivan" <pr...@iberia.es>
wrote:

> Good morning everybody,
> 
> 	We are using it in our Analyzer class and we have the following
> questions:
> 		1� Why does it change 'y' to 'i' character using parser
> method?.
> 		    Instance: study -> studi
> 		2� In our case, Lucene has searches 50 hits and is showed
> the first one only.
> 		    If I comment new PorterStemFilter(ts) from our Analyzer
> class. All 50 hits is showed. Why?
> 
> Thanks very much for your help.
> 
> > Iv�n Fern�ndez Garc�a
> > Proyecta Sistemas de Informaci�n
> > 
> > 
> > 
> > 
> > 
> ---
> Outgoing mail is certified Virus Free.
> Checked by AVG anti-virus system (http://www.grisoft.com).
> Version: 6.0.773 / Virus Database: 520 - Release Date: 05/10/2004
>  
> 
> ----------------------------------------------
> Has decidido el mejor precio.  Has decidido IBERIA.com 
> You�ve chosen the best price. You�ve chosen  IBERIA.com 
> ----------------------------------------------
> http://www.iberia.com 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org