You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Alexander Cougarman <ac...@bwc.org> on 2012/08/07 10:18:29 UTC

Stemming questions

Dear friends,

A few questions on stemming support in Solr 3.6.1:
 - Can you do non-English stemming?
 - We're using solr.PorterStemFilterFactory on the "text_en" field type. We will index a ton of PDF, DOCX, etc. docs in multiple languages. Is this the best filter factory to use for stemming? 
 - For words like "run", "runners", "running", "ran", we need all to be returned. Is there a factory that will return all those? When searching on "run", Porter returned "run", "running", "runners" but not "ran". Not sure if anything could pick that up. 
 - Is it possible to turn off the stemming filter via code, so it could be a checkbox on a web page? We will be writing this in C#. 

Thank you for your help :)

Sincerely,
Alex 


Re: Stemming questions

Posted by Tanguy Moal <ta...@gmail.com>.
Dear Alexander,

A few questions on stemming support in Solr 3.6.1:
>  - Can you do non-English stemming?
>
With solr, many languages are supported, see
http://wiki.apache.org/solr/LanguageAnalysis

 - We're using solr.PorterStemFilterFactory on the "text_en" field type. We
> will index a ton of PDF, DOCX, etc. docs in multiple languages. Is this the
> best filter factory to use for stemming?
>
I think it's hard to answer that question, so may be someone else will have
a better answer than mine!
My answer to that question would be: the best thing to do is to test the
available alternatives and then make a decision.
There are different implementations depending on the language. For English,
there is the EnglishMinimalStemFilterFactory which does, as it says in the
name, minimal stemming. I think that's essentially about plural/singular
and some other things.

 - For words like "run", "runners", "running", "ran", we need all to be
> returned. Is there a factory that will return all those? When searching on
> "run", Porter returned "run", "running", "runners" but not "ran". Not sure
> if anything could pick that up.
>
If you read the page linked above, down to
http://wiki.apache.org/solr/LanguageAnalysis#Customizing_Stemming, you'll
see that you can add custom mapping rules for unsupported cases you need to
cover.

 - Is it possible to turn off the stemming filter via code, so it could be
> a checkbox on a web page? We will be writing this in C#.

Yes it is. In practice you will not be turning stemming on or off, but
you'll have the same content indexed in distinct fields, say :
text_unstemmed and text_en, for example ... Where text_unstemmed will not
have the stemmer in the analysis pipeline and text_en would have it.

Checking the checkbox on the webpage would then simply change the query
made to solr so that the stemmed field is queried or not ;-)

Practically, you could use dismax queries, and checking the "[x] activate
stemming" would make the "qf" parameter be "text_unstemmed^2 text_en" and
unchecking the "[ ] activat stemming" would make "qf" parameter be
"text_unstemmed" only.

You can test all these using  a web browser and Solr's HTTP API before
digging into the C# client to make sure you get what you expected ;-)

Thank you for your help :)

I hope this helps :-)

Sincerely,
> Alex
>

Best regards,
Tanguy

Re: Stemming questions

Posted by Jack Krupansky <ja...@basetechnology.com>.
You could use a synonym filter to map "ran" to "run".

ran => run (and apply same filter at query and index time)

or

ran, run (only apply filter at index time, synonym filtering not needed at 
query time)

But you would have to manually add all such word forms.

-- Jack Krupansky

-----Original Message----- 
From: Alexander Cougarman
Sent: Tuesday, August 07, 2012 4:18 AM
To: solr-user@lucene.apache.org
Subject: Stemming questions

Dear friends,

A few questions on stemming support in Solr 3.6.1:
- Can you do non-English stemming?
- We're using solr.PorterStemFilterFactory on the "text_en" field type. We 
will index a ton of PDF, DOCX, etc. docs in multiple languages. Is this the 
best filter factory to use for stemming?
- For words like "run", "runners", "running", "ran", we need all to be 
returned. Is there a factory that will return all those? When searching on 
"run", Porter returned "run", "running", "runners" but not "ran". Not sure 
if anything could pick that up.
- Is it possible to turn off the stemming filter via code, so it could be a 
checkbox on a web page? We will be writing this in C#.

Thank you for your help :)

Sincerely,
Alex