You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Grant Ingersoll <gs...@syr.edu> on 2004/02/24 16:03:38 UTC

Porter Stemmer

Hi,

Is there any reason why the PorterStemmer can't be made public?  I know several people have submitted this patch, both separately and as part of other patches.  I, for one, am using it in other places as part of my overall search solution and I bet others are as well.  I guess I could understand if all stemmers were that way, but the GermanStemmer is publicly available, so it doesn't seem to be consistent.

Just wondering...

-Grant


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Kstem vs. Snowball? -- Re: Porter Stemmer

Posted by Mark Woon <mo...@helix.stanford.edu>.
David Spencer wrote:

> Out of curiosity can anyone comment on how Snowball compares with KStem,
> which appeared on the mailing list around this thread:
> http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg03740.html
>

Didn't see any responses to this question, and was wondering the same 
thing.  Can anyone with experience with these two stemmers comment?

Thanks,
-Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Kstem vs. Snowball? -- Re: Porter Stemmer

Posted by David Spencer <da...@tropo.com>.
Erik Hatcher wrote:

> On Feb 24, 2004, at 12:33 PM, Michael McGrady wrote:
> 
>> This conversation is a mystery to me.  Is there some different Porter 
>> stemmer than the one available in the Lucene source code?
> 
> 
> Yes.  As mentioned, the snowball analyzer family lives in the sandbox.  
> The CVS repository is jakarta-lucene-sandbox - look under 
> contributions/snowball for more details.  Dr. Porter's website contains 
> details on why he developed snowball over the original Porter stemmer.

Out of curiosity can anyone comment on how Snowball compares with KStem, 
which appeared on the mailing list around this thread:
http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg03740.html


Also, I thought I read somewhere about new stemmers existing that can 
return multiple stems for a word - but on examination neither KStem nor 
Snowball seem to fit this description. Memory fault?



> 
>     Erik
> 
>>
>> At 09:03 AM 2/24/2004, you wrote:
>>
>>> On Feb 24, 2004, at 10:03 AM, Grant Ingersoll wrote:
>>>
>>>> Is there any reason why the PorterStemmer can't be made public?  I 
>>>> know several people have submitted this patch, both separately and 
>>>> as part of other patches.  I, for one, am using it in other places 
>>>> as part of my overall search solution and I bet others are as well.  
>>>> I guess I could understand if all stemmers were that way, but the 
>>>> GermanStemmer is publicly available, so it doesn't seem to be 
>>>> consistent.
>>>>
>>>> Just wondering...
>>>
>>>
>>> I think we can make it public.  But an alternative is to use the 
>>> snowball code in the sandbox, which has a public PorterStemmer.
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Porter Stemmer

Posted by Michael McGrady <mi...@michaelmcgrady.com>.
For those interested: 
http://jakarta.apache.org/lucene/docs/lucene-sandbox/snowball/

At 11:25 AM 2/24/2004, you wrote:
>On Feb 24, 2004, at 12:33 PM, Michael McGrady wrote:
>>This conversation is a mystery to me.  Is there some different Porter 
>>stemmer than the one available in the Lucene source code?
>
>Yes.  As mentioned, the snowball analyzer family lives in the sandbox.
>The CVS repository is jakarta-lucene-sandbox - look under 
>contributions/snowball for more details.  Dr. Porter's website contains 
>details on why he developed snowball over the original Porter stemmer.
>
>         Erik
>
>>
>>At 09:03 AM 2/24/2004, you wrote:
>>>On Feb 24, 2004, at 10:03 AM, Grant Ingersoll wrote:
>>>>Is there any reason why the PorterStemmer can't be made public?  I know 
>>>>several people have submitted this patch, both separately and as part 
>>>>of other patches.  I, for one, am using it in other places as part of 
>>>>my overall search solution and I bet others are as well.
>>>>I guess I could understand if all stemmers were that way, but the 
>>>>GermanStemmer is publicly available, so it doesn't seem to be consistent.
>>>>
>>>>Just wondering...
>>>
>>>I think we can make it public.  But an alternative is to use the 
>>>snowball code in the sandbox, which has a public PorterStemmer.
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: RE : RE : BooleanQuery/Clauses with Linked Filters

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Feb 25, 2004, at 11:51 AM, Rasik Pandey wrote:
> Our particular case concerns adding or doing a "date query" for which 
> we are using a DateFilter. We were using a DateFilter to represent one 
> of the nested sub-queries, but since that is only possible at the same 
> level of a complex parent BooleanQuery, this obviously would affect 
> the results globally instead of respecting the nesting.  Maybe the 
> answer is the RangeQuery? I need to research that....

Yeah, RangeQuery sounds like the right thing to do in your case.

> In general, what is the performance gain, if any, when using a Filter 
> vs. adding an extra BooleanClause to retrieve the same results?

I'm not sure it can be generalized - it would depend on the query and 
the filter you're comparing.  But generally a Filter has upfront work 
to do to create the filter bit set.  For a DateFilter, this involves 
enumerating the terms in the range.  The QueryFilter performs an actual 
query, so it would be dependent on what it needed to do.

Filters are good when you can create them and let them live over the 
the course of multiple queries.

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


RE : RE : BooleanQuery/Clauses with Linked Filters

Posted by Rasik Pandey <ra...@ajlsm.com>.
> Its still not entirely clear, but it seems you could accomplish
> what
> you want by putting in some AND TermQuery's in there instead of
> trying
> to use a Filter.  Wouldn't that do what you want?

Ok so the example wasn't as concrete as it should have been and you are right about the fact that I could add some more sub-queries to get the same effect. 

Our particular case concerns adding or doing a "date query" for which we are using a DateFilter. We were using a DateFilter to represent one of the nested sub-queries, but since that is only possible at the same level of a complex parent BooleanQuery, this obviously would affect the results globally instead of respecting the nesting.  Maybe the answer is the RangeQuery? I need to research that....

In general, what is the performance gain, if any, when using a Filter vs. adding an extra BooleanClause to retrieve the same results?


RBP 



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: RE : BooleanQuery/Clauses with Linked Filters

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Feb 25, 2004, at 9:39 AM, Rasik Pandey wrote:
>> On Feb 25, 2004, at 6:48 AM, Rasik Pandey wrote:
>>> I was wondering if it is somehow possible to tie filters to
>> individual
>>> boolean queries/clauses. I didn't see any such possibility
>> with a
>>> quick inspection of the code and I am not sure if a special
>>> HitCollector implementation would be sufficient or even
>> merited. Any
>>> ideas, suggestions, or comments would be appreciated.
>>
>> I'm not following what you mean.  Could you describe a concrete
>> example
>> of what you're after?
>
> Searching Lucene with a BooleanQuery using Searcher.search(Query 
> query, Filter filter) does not allow for coupling filters to 
> sub-queries of the BooleanQuery, but rather coupling a filter to the 
> parent BooleanQuery.

right - you cannot apply filters to subqueries.  A filter is really a 
pre-query screen, not really part of the query itself.

> Currently we can do this:
>
> BooleanQuery not Filter(state, texas) not Filter(country, georgia)
> -TermQuery(city, paris)
> OR
> -TermQuery(state, georgia)
>
>
> instead we would like to be able to do this.
> BooleanQuery
> -TermQuery(city, paris) not Filter(state, texas)
> OR
> -TermQuery(state, georgia) not Filter(country, georgia)
>
>
> Let me know if this isn't clear.

Its still not entirely clear, but it seems you could accomplish what 
you want by putting in some AND TermQuery's in there instead of trying 
to use a Filter.  Wouldn't that do what you want?

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


RE : BooleanQuery/Clauses with Linked Filters

Posted by Rasik Pandey <ra...@ajlsm.com>.
> On Feb 25, 2004, at 6:48 AM, Rasik Pandey wrote:
> > I was wondering if it is somehow possible to tie filters to
> individual
> > boolean queries/clauses. I didn't see any such possibility
> with a
> > quick inspection of the code and I am not sure if a special
> > HitCollector implementation would be sufficient or even
> merited. Any
> > ideas, suggestions, or comments would be appreciated.
> 
> I'm not following what you mean.  Could you describe a concrete
> example
> of what you're after?

Searching Lucene with a BooleanQuery using Searcher.search(Query query, Filter filter) does not allow for coupling filters to sub-queries of the BooleanQuery, but rather coupling a filter to the parent BooleanQuery.


Currently we can do this:

BooleanQuery not Filter(state, texas) not Filter(country, georgia)
-TermQuery(city, paris) 
OR
-TermQuery(state, georgia)


instead we would like to be able to do this.
BooleanQuery 
-TermQuery(city, paris) not Filter(state, texas)
OR
-TermQuery(state, georgia) not Filter(country, georgia)


Let me know if this isn't clear.

RBP



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: BooleanQuery/Clauses with Linked Filters

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Feb 25, 2004, at 6:48 AM, Rasik Pandey wrote:
> I was wondering if it is somehow possible to tie filters to individual 
> boolean queries/clauses. I didn't see any such possibility with a 
> quick inspection of the code and I am not sure if a special 
> HitCollector implementation would be sufficient or even merited. Any 
> ideas, suggestions, or comments would be appreciated.

I'm not following what you mean.  Could you describe a concrete example 
of what you're after?



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


BooleanQuery/Clauses with Linked Filters

Posted by Rasik Pandey <ra...@ajlsm.com>.
Hello,

I was wondering if it is somehow possible to tie filters to individual boolean queries/clauses. I didn't see any such possibility with a quick inspection of the code and I am not sure if a special HitCollector implementation would be sufficient or even merited. Any ideas, suggestions, or comments would be appreciated.

Thanks,
RBP



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Another website using Lucene

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Feb 25, 2004, at 3:21 AM, Paul Kavanagh wrote:
> I have the 'Powered By Lucene' text on the search results page. Any 
> chance
> someone could add my site to the list of sites on the Powered By page ?
>
> Here's the link:
> http://dublin.citycollective.com

Done!  (I added it to the files in CVS - the site update will happen at 
some point in the future)

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Another website using Lucene

Posted by Paul Kavanagh <p_...@yahoo.com>.
Hi guys,

this is a shameless plug for another website whose search functionality is
driven by Lucene. Thanks to all for such a quality and easy to use product.
I have recommended it to many.

I have the 'Powered By Lucene' text on the search results page. Any chance
someone could add my site to the list of sites on the Powered By page ?

Here's the link:
http://dublin.citycollective.com

Cheers,
-Paul

_______________________________
Paul Kavanagh
Founder
Dublin City Collective

email: paul@citycollective.com
web: http://dublin.citycollective.com


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Porter Stemmer

Posted by Michael McGrady <mi...@michaelmcgrady.com>.
Thanks, we learn something new on all lucky days.

At 11:25 AM 2/24/2004, you wrote:
>On Feb 24, 2004, at 12:33 PM, Michael McGrady wrote:
>>This conversation is a mystery to me.  Is there some different Porter 
>>stemmer than the one available in the Lucene source code?
>
>Yes.  As mentioned, the snowball analyzer family lives in the sandbox.
>The CVS repository is jakarta-lucene-sandbox - look under 
>contributions/snowball for more details.  Dr. Porter's website contains 
>details on why he developed snowball over the original Porter stemmer.
>
>         Erik
>
>>
>>At 09:03 AM 2/24/2004, you wrote:
>>>On Feb 24, 2004, at 10:03 AM, Grant Ingersoll wrote:
>>>>Is there any reason why the PorterStemmer can't be made public?  I know 
>>>>several people have submitted this patch, both separately and as part 
>>>>of other patches.  I, for one, am using it in other places as part of 
>>>>my overall search solution and I bet others are as well.
>>>>I guess I could understand if all stemmers were that way, but the 
>>>>GermanStemmer is publicly available, so it doesn't seem to be consistent.
>>>>
>>>>Just wondering...
>>>
>>>I think we can make it public.  But an alternative is to use the 
>>>snowball code in the sandbox, which has a public PorterStemmer.
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Kstem vs. Snowball? -- Re: Porter Stemmer

Posted by David Spencer <da...@tropo.com>.
Erik Hatcher wrote:

> On Feb 24, 2004, at 12:33 PM, Michael McGrady wrote:
> 
>> This conversation is a mystery to me.  Is there some different Porter 
>> stemmer than the one available in the Lucene source code?
> 
> 
> Yes.  As mentioned, the snowball analyzer family lives in the sandbox.  
> The CVS repository is jakarta-lucene-sandbox - look under 
> contributions/snowball for more details.  Dr. Porter's website contains 
> details on why he developed snowball over the original Porter stemmer.

Out of curiosity can anyone comment on how Snowball compares with KStem,
which appeared on the mailing list around this thread:
http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg03740.html


Also, I thought I read somewhere about new stemmers existing that can
return multiple stems for a word - but on examination neither KStem nor
Snowball seem to fit this description. Memory fault?



> 
>     Erik
> 
>>
>> At 09:03 AM 2/24/2004, you wrote:
>>
>>> On Feb 24, 2004, at 10:03 AM, Grant Ingersoll wrote:
>>>
>>>> Is there any reason why the PorterStemmer can't be made public?  I 
>>>> know several people have submitted this patch, both separately and 
>>>> as part of other patches.  I, for one, am using it in other places 
>>>> as part of my overall search solution and I bet others are as well.  
>>>> I guess I could understand if all stemmers were that way, but the 
>>>> GermanStemmer is publicly available, so it doesn't seem to be 
>>>> consistent.
>>>>
>>>> Just wondering...
>>>
>>>
>>> I think we can make it public.  But an alternative is to use the 
>>> snowball code in the sandbox, which has a public PorterStemmer.
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Porter Stemmer

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Feb 24, 2004, at 12:33 PM, Michael McGrady wrote:
> This conversation is a mystery to me.  Is there some different Porter 
> stemmer than the one available in the Lucene source code?

Yes.  As mentioned, the snowball analyzer family lives in the sandbox.  
The CVS repository is jakarta-lucene-sandbox - look under 
contributions/snowball for more details.  Dr. Porter's website contains 
details on why he developed snowball over the original Porter stemmer.

	Erik

>
> At 09:03 AM 2/24/2004, you wrote:
>> On Feb 24, 2004, at 10:03 AM, Grant Ingersoll wrote:
>>> Is there any reason why the PorterStemmer can't be made public?  I 
>>> know several people have submitted this patch, both separately and 
>>> as part of other patches.  I, for one, am using it in other places 
>>> as part of my overall search solution and I bet others are as well.  
>>> I guess I could understand if all stemmers were that way, but the 
>>> GermanStemmer is publicly available, so it doesn't seem to be 
>>> consistent.
>>>
>>> Just wondering...
>>
>> I think we can make it public.  But an alternative is to use the 
>> snowball code in the sandbox, which has a public PorterStemmer.


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Porter Stemmer

Posted by Michael McGrady <mi...@michaelmcgrady.com>.
This conversation is a mystery to me.  Is there some different Porter 
stemmer than the one available in the Lucene source code?

At 09:03 AM 2/24/2004, you wrote:
>On Feb 24, 2004, at 10:03 AM, Grant Ingersoll wrote:
>>Is there any reason why the PorterStemmer can't be made public?  I know 
>>several people have submitted this patch, both separately and as part of 
>>other patches.  I, for one, am using it in other places as part of my 
>>overall search solution and I bet others are as well.  I guess I could 
>>understand if all stemmers were that way, but the GermanStemmer is 
>>publicly available, so it doesn't seem to be consistent.
>>
>>Just wondering...
>
>I think we can make it public.  But an alternative is to use the snowball 
>code in the sandbox, which has a public PorterStemmer.
>
>         Erik
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Porter Stemmer

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Feb 24, 2004, at 10:03 AM, Grant Ingersoll wrote:
> Is there any reason why the PorterStemmer can't be made public?  I 
> know several people have submitted this patch, both separately and as 
> part of other patches.  I, for one, am using it in other places as 
> part of my overall search solution and I bet others are as well.  I 
> guess I could understand if all stemmers were that way, but the 
> GermanStemmer is publicly available, so it doesn't seem to be 
> consistent.
>
> Just wondering...

I think we can make it public.  But an alternative is to use the 
snowball code in the sandbox, which has a public PorterStemmer.

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org