You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Imran <im...@gmail.com> on 2010/10/29 12:25:04 UTC

Influencing scores on values in multiValue fields

Hi All

We've got an index in which we have a multiValued field per document.

Assume the multivalue field values in each document to be;

Doc1:
bar lifters

Doc2:
truck tires
back drops
bar lifters

Doc 3:
iron bar lifters

Doc 4:
brass bar lifters
iron bar lifters
tire something
truck something
oil gas

Now when we search for 'bar lifters' the expectation (based on the
requirements) is that we get results in the order of Doc1, Doc 2, Doc4 and
Doc3.
Doc 1 - since there's an exact match (and only one) for the search terms
Doc 2 - since ther'e an exact match amongst the values
Doc 4 - since there's a partial match on the values but the number of
matches are more than Doc 3
Doc 3 - since there's a partial match

However, the results come out as Doc1, Doc3, Doc2, Doc4. Looking at the
explaination of the result it appears Doc 2 is loosing to Doc3 and Doc 4 is
loosing to Doc3 based on length normalisation.

We think we can see the reason for that - the field length in doc2 is
greater than doc3 and doc 4 is greater doc3.
However, is there any mechanism I can force doc2 to beat doc3 and doc4 to
beat doc3 with this structure.

We did look at using omitNorms=true, but that messes up the scores for all
docs. The result comes out as Doc4, Doc1, Doc2, Doc3 (where Doc1, Doc2 and
Doc3 gets the same score)
This is because the fieldNorm is not taken into account anymore (as
expected) and the termFrequence being the only contributing factor. So
trying to avoid length normalisation through omitNorms is not helping.

Is there anyway where we can influence an exact match of a value in a
multiValue field to add on to the overall score whilst keeping the lenght
normalisation?

Hope that makes sense.

Cheers
-- Imran

Re: Influencing scores on values in multiValue fields

Posted by Jonathan Rochkind <ro...@jhu.edu>.
Be careful of multi-term queries and String types.   By multi-term here, 
I mean multi-term according to the 'pre-tokenization' that dismax and 
standard parsers do -- basically on whitespace.  If you have a string 
with whitespace as a single (non-tokenized field) in a Solr String type, 
and you have a q that is that identical string (with whitespace, but NOT 
enclosed in phrase quotes) -- it still won't match.  Because of the 
pre-tokenization-on-whitespace that the query parsers do.

It WILL still match if you put the q in double quotes for a phrase. And 
it WILL still match for a dismax pf phrase boost.  But it will not match 
a dismax qf field, or a standard query parser fielded q search.

This makes this approach to solving the problem not always do what you'd 
like. I haven't figured out a better one though. With dismax, if you 
include it both as a boosted field in qf (which will match on 
single-term queries, but not on queries with whitespace) AND as a 
boosted field in pf (which will match on queries with whitespace, but 
wont' be used at all for queries without whitespace, as dismax doesn't 
even bring the pf into play unless the pre-tokenization comes up with 
more than one term) -- it seems to mostly do what you'd want.  An 
alternate strategy might be trying to use it as a dismax bq query, since 
you can tell bq to use an alternate query parser (for example !field or 
!raw) that won't do the pre-tokenization.

Imran wrote:
> Thanks Mike for your suggestion. It did take me down the correct route. I
> basically created another multiValue field of type 'string' and boosted
> that. To get the partial matches to avoid the length normalisation I had the
> 'text' type multiValue field to omitNorms. The results look as per expected
> so far on this configuration.
>
> Cheers
> -- Imran
>
> On Fri, Oct 29, 2010 at 1:09 PM, Michael Sokolov <so...@ifactory.com>wrote:
>
>   
>> How about creating another field for doing exact matches (a string);
>> searching both and boosting the string match?
>>
>> -Mike
>>
>>     
>>> -----Original Message-----
>>> From: Imran [mailto:imranbohoran@gmail.com]
>>> Sent: Friday, October 29, 2010 6:25 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: Influencing scores on values in multiValue fields
>>>
>>> Hi All
>>>
>>> We've got an index in which we have a multiValued field per document.
>>>
>>> Assume the multivalue field values in each document to be;
>>>
>>> Doc1:
>>> bar lifters
>>>
>>> Doc2:
>>> truck tires
>>> back drops
>>> bar lifters
>>>
>>> Doc 3:
>>> iron bar lifters
>>>
>>> Doc 4:
>>> brass bar lifters
>>> iron bar lifters
>>> tire something
>>> truck something
>>> oil gas
>>>
>>> Now when we search for 'bar lifters' the expectation (based on the
>>> requirements) is that we get results in the order of Doc1,
>>> Doc 2, Doc4 and Doc3.
>>> Doc 1 - since there's an exact match (and only one) for the
>>> search terms Doc 2 - since ther'e an exact match amongst the
>>> values Doc 4 - since there's a partial match on the values
>>> but the number of matches are more than Doc 3 Doc 3 - since
>>> there's a partial match
>>>
>>> However, the results come out as Doc1, Doc3, Doc2, Doc4.
>>> Looking at the explaination of the result it appears Doc 2 is
>>> loosing to Doc3 and Doc 4 is loosing to Doc3 based on length
>>> normalisation.
>>>
>>> We think we can see the reason for that - the field length in
>>> doc2 is greater than doc3 and doc 4 is greater doc3.
>>> However, is there any mechanism I can force doc2 to beat doc3
>>> and doc4 to beat doc3 with this structure.
>>>
>>> We did look at using omitNorms=true, but that messes up the
>>> scores for all docs. The result comes out as Doc4, Doc1,
>>> Doc2, Doc3 (where Doc1, Doc2 and
>>> Doc3 gets the same score)
>>> This is because the fieldNorm is not taken into account anymore (as
>>> expected) and the termFrequence being the only contributing
>>> factor. So trying to avoid length normalisation through
>>> omitNorms is not helping.
>>>
>>> Is there anyway where we can influence an exact match of a
>>> value in a multiValue field to add on to the overall score
>>> whilst keeping the lenght normalisation?
>>>
>>> Hope that makes sense.
>>>
>>> Cheers
>>> -- Imran
>>>
>>>       
>>     
>
>   

Re: Influencing scores on values in multiValue fields

Posted by Imran <im...@gmail.com>.
Thanks Mike for your suggestion. It did take me down the correct route. I
basically created another multiValue field of type 'string' and boosted
that. To get the partial matches to avoid the length normalisation I had the
'text' type multiValue field to omitNorms. The results look as per expected
so far on this configuration.

Cheers
-- Imran

On Fri, Oct 29, 2010 at 1:09 PM, Michael Sokolov <so...@ifactory.com>wrote:

> How about creating another field for doing exact matches (a string);
> searching both and boosting the string match?
>
> -Mike
>
> > -----Original Message-----
> > From: Imran [mailto:imranbohoran@gmail.com]
> > Sent: Friday, October 29, 2010 6:25 AM
> > To: solr-user@lucene.apache.org
> > Subject: Influencing scores on values in multiValue fields
> >
> > Hi All
> >
> > We've got an index in which we have a multiValued field per document.
> >
> > Assume the multivalue field values in each document to be;
> >
> > Doc1:
> > bar lifters
> >
> > Doc2:
> > truck tires
> > back drops
> > bar lifters
> >
> > Doc 3:
> > iron bar lifters
> >
> > Doc 4:
> > brass bar lifters
> > iron bar lifters
> > tire something
> > truck something
> > oil gas
> >
> > Now when we search for 'bar lifters' the expectation (based on the
> > requirements) is that we get results in the order of Doc1,
> > Doc 2, Doc4 and Doc3.
> > Doc 1 - since there's an exact match (and only one) for the
> > search terms Doc 2 - since ther'e an exact match amongst the
> > values Doc 4 - since there's a partial match on the values
> > but the number of matches are more than Doc 3 Doc 3 - since
> > there's a partial match
> >
> > However, the results come out as Doc1, Doc3, Doc2, Doc4.
> > Looking at the explaination of the result it appears Doc 2 is
> > loosing to Doc3 and Doc 4 is loosing to Doc3 based on length
> > normalisation.
> >
> > We think we can see the reason for that - the field length in
> > doc2 is greater than doc3 and doc 4 is greater doc3.
> > However, is there any mechanism I can force doc2 to beat doc3
> > and doc4 to beat doc3 with this structure.
> >
> > We did look at using omitNorms=true, but that messes up the
> > scores for all docs. The result comes out as Doc4, Doc1,
> > Doc2, Doc3 (where Doc1, Doc2 and
> > Doc3 gets the same score)
> > This is because the fieldNorm is not taken into account anymore (as
> > expected) and the termFrequence being the only contributing
> > factor. So trying to avoid length normalisation through
> > omitNorms is not helping.
> >
> > Is there anyway where we can influence an exact match of a
> > value in a multiValue field to add on to the overall score
> > whilst keeping the lenght normalisation?
> >
> > Hope that makes sense.
> >
> > Cheers
> > -- Imran
> >
>
>

RE: Influencing scores on values in multiValue fields

Posted by Michael Sokolov <so...@ifactory.com>.
How about creating another field for doing exact matches (a string);
searching both and boosting the string match?

-Mike 

> -----Original Message-----
> From: Imran [mailto:imranbohoran@gmail.com] 
> Sent: Friday, October 29, 2010 6:25 AM
> To: solr-user@lucene.apache.org
> Subject: Influencing scores on values in multiValue fields
> 
> Hi All
> 
> We've got an index in which we have a multiValued field per document.
> 
> Assume the multivalue field values in each document to be;
> 
> Doc1:
> bar lifters
> 
> Doc2:
> truck tires
> back drops
> bar lifters
> 
> Doc 3:
> iron bar lifters
> 
> Doc 4:
> brass bar lifters
> iron bar lifters
> tire something
> truck something
> oil gas
> 
> Now when we search for 'bar lifters' the expectation (based on the
> requirements) is that we get results in the order of Doc1, 
> Doc 2, Doc4 and Doc3.
> Doc 1 - since there's an exact match (and only one) for the 
> search terms Doc 2 - since ther'e an exact match amongst the 
> values Doc 4 - since there's a partial match on the values 
> but the number of matches are more than Doc 3 Doc 3 - since 
> there's a partial match
> 
> However, the results come out as Doc1, Doc3, Doc2, Doc4. 
> Looking at the explaination of the result it appears Doc 2 is 
> loosing to Doc3 and Doc 4 is loosing to Doc3 based on length 
> normalisation.
> 
> We think we can see the reason for that - the field length in 
> doc2 is greater than doc3 and doc 4 is greater doc3.
> However, is there any mechanism I can force doc2 to beat doc3 
> and doc4 to beat doc3 with this structure.
> 
> We did look at using omitNorms=true, but that messes up the 
> scores for all docs. The result comes out as Doc4, Doc1, 
> Doc2, Doc3 (where Doc1, Doc2 and
> Doc3 gets the same score)
> This is because the fieldNorm is not taken into account anymore (as
> expected) and the termFrequence being the only contributing 
> factor. So trying to avoid length normalisation through 
> omitNorms is not helping.
> 
> Is there anyway where we can influence an exact match of a 
> value in a multiValue field to add on to the overall score 
> whilst keeping the lenght normalisation?
> 
> Hope that makes sense.
> 
> Cheers
> -- Imran
>