You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by David Woodward <dw...@loc.gov> on 2009/01/17 00:20:08 UTC

Words that need protection from stemming, i.e., protwords.txt

Hi.

Any good protwords.txt out there?

In a fairly standard solr analyzer chain, we use the English Porter analyzer like so:

<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>

For most purposes the porter does just fine, but occasionally words come along that really don't work out to well, e.g.,

"maine" is stemmed to "main" - clearly goofing up precision about "Maine" without doing much good for variants of "main".

So - I have an entry for my protwords.txt. What else should go in there?

Thanks for your ideas,

Dave Woodward


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Words that need protection from stemming, i.e., protwords.txt

Posted by Chris Hostetter <ho...@fucit.org>.
: Subject: Words that need protection from stemming, i.e., protwords.txt
: References: <49...@gmail.com>
:  <39...@gmail.com>
:  <49...@stimulussoft.com>
: In-Reply-To: <49...@stimulussoft.com>

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/Thread_hijacking





-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Words that need protection from stemming, i.e., protwords.txt

Posted by patrick o'leary <pj...@pjaol.com>.
Porter is a little outdated I've found KStem much better
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters/Kstem

You'll still need a good protected word list, but KStem is just a little
nicer

On Fri, Jan 16, 2009 at 6:20 PM, David Woodward <dw...@loc.gov> wrote:

> Hi.
>
> Any good protwords.txt out there?
>
> In a fairly standard solr analyzer chain, we use the English Porter
> analyzer like so:
>
> <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
>
> For most purposes the porter does just fine, but occasionally words come
> along that really don't work out to well, e.g.,
>
> "maine" is stemmed to "main" - clearly goofing up precision about "Maine"
> without doing much good for variants of "main".
>
> So - I have an entry for my protwords.txt. What else should go in there?
>
> Thanks for your ideas,
>
> Dave Woodward
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>