You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Tao, Jing" <jt...@webmd.net> on 2015/03/30 19:48:16 UTC

protected phrases - possible?

Hi,

The way our collection is setup, searches for "breast cancer" are returning results for ovarian cancer, or anything that contains either "breast" or "cancer".  The reason is, we are searching across multiple fields.  Even though I have set a "mm" value so that if less than 3 terms, ALL terms much match...SOLR considers it all matched even though "breast" was in the title and "cancer" is in the description.

Is there a way to protect certain phrases so that they will not be tokenized?  I tried using CommonGramsFilterFactory, but having "breast cancer" in the word list did not seem to do anything.  I'm guessing it's because the field is tokenized first, so nothing would match that phrase.  If I put "breast" and "cancer" as separate entries in the word list, I end up with too many unnecessary shingles, and "breast" and "cancer" are still two of the final terms.

I have a feeling CommonGramsFilterFactory is not the right way to handle this.  What are other options?  Is it better to put all fields in one field, apply mm, and proximity boost?

Thanks!
Jing

Re: protected phrases - possible?

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.
Hi Jing,

You can boost phrases by pf (phrase fields) parameter. If you don't like this solution, you can modify search query at client side. E.g. surround certain phrases with quotes. This will force proximity search without interfering with tokenisation.

Ahmet


On Monday, March 30, 2015 8:49 PM, "Tao, Jing" <jt...@webmd.net> wrote:
Hi,

The way our collection is setup, searches for "breast cancer" are returning results for ovarian cancer, or anything that contains either "breast" or "cancer".  The reason is, we are searching across multiple fields.  Even though I have set a "mm" value so that if less than 3 terms, ALL terms much match...SOLR considers it all matched even though "breast" was in the title and "cancer" is in the description.

Is there a way to protect certain phrases so that they will not be tokenized?  I tried using CommonGramsFilterFactory, but having "breast cancer" in the word list did not seem to do anything.  I'm guessing it's because the field is tokenized first, so nothing would match that phrase.  If I put "breast" and "cancer" as separate entries in the word list, I end up with too many unnecessary shingles, and "breast" and "cancer" are still two of the final terms.

I have a feeling CommonGramsFilterFactory is not the right way to handle this.  What are other options?  Is it better to put all fields in one field, apply mm, and proximity boost?

Thanks!
Jing