You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Satish Kumar <sa...@gmail.com> on 2010/08/05 21:28:20 UTC

anti-words - exact match

Hi,

We have a requirement to NOT display search results if user query contains
terms that are in our anti-words field. For example, if user query is "I
have swollen foot" and if some records in our index have "swollen foot" in
anti-words field, we don't want to display those records. How do I go about
implementing this?

NOTE 1: anti-words field can contain multiple values. Each value can be a
one or multiple words (e.g. "swollen foot", "headache", etc. )

NOTE 2: the match must be exact. If anti-words field contains "swollen foot"
and if user query is "I have swollen foot", record must be excluded. If user
query is "My foot is swollen", the record should not be excluded.

Any pointers is greatly appreciated!


Thanks,
Satish

Re: anti-words - exact match

Posted by Satish Kumar <sa...@gmail.com>.

Thanks Jon.

My initial thought was exactly like yours. My preference was to implement
this requirement completely at Solr level so that different applications
won't have to put this logic. However, I am not sure how to shingle-ize the
input query and use that in filter query with a NOT operator at the solr
layer. The other option as you suggested is to single-ize the input query in
the application layer -- this is doable, but means adding logic in
application layer.

For now I am settling on the below solution:

- each anti-word (can be multiple words) will be stored as separate token.
The input record will contain different anti-word separated by
comma. solr.PatternTokenizerFactory will be used to split on comma and
create tokens

- the list of anti-words is stored in memory in application layer and
anti-words are extracted from the user entered query (e.g. if user enteres
'I have swollen foot' and 'swollen foot' is anti-word, swollen foot is
extracted)

- filter query with NOT operator on anti-word field is sent to solr


Thanks much!

Satish

This is tricky. You could try doing something with the ShingleFilter (
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ShingleFilterFactory)
> at _query time_ to turn the users query:
>
> "i have a swollen foot" into:
> "i", "i have", "i have a", "i have a swollen", .... "have", "have a", "have
> a swollen"... etc.
>
> I _think_ you can get the ShingleFilter factory to do that.
>
> But now you only want to exclude if one of those shingles matches the
> ENTIRE "anti-word". So maybe index as non-tokenized, so each of those
> shingles will somehow only match on the complete thing.  You'd want to
> normalize spacing and punctuation.
>
> But then you need to turn that into a _negated_ element of your query.
> Perhaps by using an fq with a NOT/"-" in it? And a query which 'matches'
> (causing 'not' behavior) if _any_ of the shingles match.
>
> I have no idea if it's actually possible to put these things together in
> that way. A non-tokenized field? Which still has it's queries shingle-ized
> at query time? And then works as a negated query, matching for negation if
> any of the shingles match?  Not really sure how to put that together in your
> solrconfig.xml and/or application logic if needed. You could try.
>

yup, I didn't know how to shingle-ized the input query and use that as input
in filter query.


> Another option would be doing the query-time 'shingling' in your app, and
> then it's a somewhat more normal Solr query. &fq= -"shingle one" -"shingle
> two" -"shingle three" etc.  Or put em in seperate fq's depending on how you
> want to use your filter cache. Still searching on a non-tokenized field, and
> still normalizing on white-space and punctuation at both index time and
> (using same normalization logic but in your application logic this time)
> query time.  I think that might work.
>
> So I'm not really sure, but maybe that gives you some ideas.
>
> Jonathan
>
>
>
>
> Satish Kumar wrote:
>
>> Hi,
>>
>> We have a requirement to NOT display search results if user query contains
>> terms that are in our anti-words field. For example, if user query is "I
>> have swollen foot" and if some records in our index have "swollen foot" in
>> anti-words field, we don't want to display those records. How do I go
>> about
>> implementing this?
>>
>> NOTE 1: anti-words field can contain multiple values. Each value can be a
>> one or multiple words (e.g. "swollen foot", "headache", etc. )
>>
>> NOTE 2: the match must be exact. If anti-words field contains "swollen
>> foot"
>> and if user query is "I have swollen foot", record must be excluded. If
>> user
>> query is "My foot is swollen", the record should not be excluded.
>>
>> Any pointers is greatly appreciated!
>>
>>
>> Thanks,
>> Satish
>>
>>
>>
>

Re: anti-words - exact match

Posted by Jonathan Rochkind <ro...@jhu.edu>.

This is tricky. You could try doing something with the ShingleFilter 
(http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ShingleFilterFactory) 
at _query time_ to turn the users query:

"i have a swollen foot" into:
"i", "i have", "i have a", "i have a swollen", .... "have", "have a", 
"have a swollen"... etc.

I _think_ you can get the ShingleFilter factory to do that.

But now you only want to exclude if one of those shingles matches the 
ENTIRE "anti-word". So maybe index as non-tokenized, so each of those 
shingles will somehow only match on the complete thing.  You'd want to 
normalize spacing and punctuation.

But then you need to turn that into a _negated_ element of your query. 
Perhaps by using an fq with a NOT/"-" in it? And a query which 'matches' 
(causing 'not' behavior) if _any_ of the shingles match.

I have no idea if it's actually possible to put these things together in 
that way. A non-tokenized field? Which still has it's queries 
shingle-ized at query time? And then works as a negated query, matching 
for negation if any of the shingles match?  Not really sure how to put 
that together in your solrconfig.xml and/or application logic if needed. 
You could try.

Another option would be doing the query-time 'shingling' in your app, 
and then it's a somewhat more normal Solr query. &fq= -"shingle one" 
-"shingle two" -"shingle three" etc.  Or put em in seperate fq's 
depending on how you want to use your filter cache. Still searching on a 
non-tokenized field, and still normalizing on white-space and 
punctuation at both index time and (using same normalization logic but 
in your application logic this time) query time.  I think that might work.

So I'm not really sure, but maybe that gives you some ideas.

Jonathan

Satish Kumar wrote:
> Hi,
>
> We have a requirement to NOT display search results if user query contains
> terms that are in our anti-words field. For example, if user query is "I
> have swollen foot" and if some records in our index have "swollen foot" in
> anti-words field, we don't want to display those records. How do I go about
> implementing this?
>
> NOTE 1: anti-words field can contain multiple values. Each value can be a
> one or multiple words (e.g. "swollen foot", "headache", etc. )
>
> NOTE 2: the match must be exact. If anti-words field contains "swollen foot"
> and if user query is "I have swollen foot", record must be excluded. If user
> query is "My foot is swollen", the record should not be excluded.
>
> Any pointers is greatly appreciated!
>
>
> Thanks,
> Satish
>
>