You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Brian Lamb <br...@journalexperts.com> on 2011/07/26 23:33:58 UTC

Exact match not the first result returned

Hi all,

I am a little confused as to why the scoring is working the way it is:

I have a field defined as:

<field name="myname" type="text" indexed="true" stored="true"
required="false" multivalued="true" />

And I have several documents where that value is:

RECORD 1
<arr name="myname">
  <str>Fred</str>
  <str>Fred (the coolest guy in town)</str>
</arr>

OR

RECORD 2
<arr name="myname">
  <str>Fred Anderson</str>
</arr>

What happens when I do a search for
http://localhost:8983/solr/search/?q=myname:Fred I get RECORD 2
returned before RECORD 1.

RECORD 2
5.282213 = (MATCH) fieldWeight(myname:Fred in 256575), product of:
  1.0 = tf(termFreq(myname:Fred)=1)
  8.451541 = idf(docFreq=7306, maxDocs=12586425)
  0.625 = fieldNorm(field=myname, doc=256575)

RECORD 1
4.482106 = (MATCH) fieldWeight(myname:Fred in 215), product of:
  1.4142135 = tf(termFreq(myname:Fred)=2)
  8.451541 = idf(docFreq=7306, maxDocs=12586425)
  0.375 = fieldNorm(field=myname, doc=215)

So the difference is fieldNorm obviously but I think that's only part
of the story. Why is RECORD 2 returned with a higher score than RECORD
1 even though RECORD 1 matches "Fred" exactly? And how should I do
this differently so that I am getting the results I am expecting?

Thanks,

Brian Lamb

Re: Exact match not the first result returned

Posted by Jonathan Rochkind <ro...@jhu.edu>.

Keep in mind that if you use a field type that includes spaces (eg 
StrField, or KeywordTokenizer), then if you're using dismax or lucene 
query parsers, the only way to find matches in this field on queries 
that include spaces will be to do explicit phrase searches with double 
quotes.

These fields will, however, work fine with "pf" in dismax/edismax as per 
Hoss's example.

But yeah, I do what Hoss recommends -- I've got a KeywordTokenizer copy 
of my searchable field. I use a pf on that field with a very high boost 
to try and boost truly "complete" matches, that match the entirety of 
the value.  It's not exactly 'exact', I still do some normalization, 
including flattening unicode to ascii, and normalizing 1 or more 
string-or-punctuation to exactly 1 one space using a char regex filter.

It seems to pretty much work -- this is just one of various relevancy 
tweaks I've got going on, to the extent that my relevancy has become 
pretty complicated and hard to predict and doesn't always do what I'd 
expect/intend, but this particular aspect seems to mostly pretty much work.

On 7/27/2011 10:55 PM, Chris Hostetter wrote:
> : With your solution, RECORD 1 does appear at the top but I think thats just
> : blind luck more than anything else because RECORD 3 shows as having the same
> : score. So what more can I do to push RECORD 1 up to the top. Ideally, I'd
> : like all three records returned with RECORD 1 being the first listing.
>
> with omitNorms RECORD1 and RECORD3 have the same score because only the
> tf() matters, and both docs contain the term "frank" exactly twice.
>
> the reason RECORD1 isn't scoring higher even though it contains (as you
> put it "matchings 'Fred' exactly" is that from a term perspective, RECORD1
> doesn't actually match "myname:Fred" exactly, because there are in fact
> other terms in that field because it's multivalued.
>
> one way to indicate that you (only* want documents where entire field
> values to match your input (ie: RECORD1 but no other records) would be to
> use a StrField instead of a TextField or an analyzer that doesn't split up
> tokens (lie: something using KeywordTokenizer).  that way a query on
> myname:Frank would not match a document where you had indexed the value
> "Frank Stalone" by a query for myname:"Frank Stalone" would.
>
> in your case, you don't want *only* the exact field value matches, but you
> want them boosted, so you could do something like copyField "myname" into
> "myname_str" and then do...
>
>    q=+myname:Frank myname_str:"Frank"^100
>
> ...in which case a match on "myname" is required, but a match on
> "myname_str" will greatly increase the score.
>
> dismax (and edismax) are really designed for situations like this...
>
>    defType=dismax&  qf=myname&  pf=myname_str^100&  q=Frank
>
>
>
> -Hoss
>

Re: Exact match not the first result returned

Posted by Brian Lamb <br...@journalexperts.com>.

That's a clever idea. I'll put something together and see how it turns out.
Thanks for the tip.

On Wed, Jul 27, 2011 at 10:55 PM, Chris Hostetter
<ho...@fucit.org>wrote:

>
> : With your solution, RECORD 1 does appear at the top but I think thats
> just
> : blind luck more than anything else because RECORD 3 shows as having the
> same
> : score. So what more can I do to push RECORD 1 up to the top. Ideally, I'd
> : like all three records returned with RECORD 1 being the first listing.
>
> with omitNorms RECORD1 and RECORD3 have the same score because only the
> tf() matters, and both docs contain the term "frank" exactly twice.
>
> the reason RECORD1 isn't scoring higher even though it contains (as you
> put it "matchings 'Fred' exactly" is that from a term perspective, RECORD1
> doesn't actually match "myname:Fred" exactly, because there are in fact
> other terms in that field because it's multivalued.
>
> one way to indicate that you (only* want documents where entire field
> values to match your input (ie: RECORD1 but no other records) would be to
> use a StrField instead of a TextField or an analyzer that doesn't split up
> tokens (lie: something using KeywordTokenizer).  that way a query on
> myname:Frank would not match a document where you had indexed the value
> "Frank Stalone" by a query for myname:"Frank Stalone" would.
>
> in your case, you don't want *only* the exact field value matches, but you
> want them boosted, so you could do something like copyField "myname" into
> "myname_str" and then do...
>
>  q=+myname:Frank myname_str:"Frank"^100
>
> ...in which case a match on "myname" is required, but a match on
> "myname_str" will greatly increase the score.
>
> dismax (and edismax) are really designed for situations like this...
>
>  defType=dismax & qf=myname & pf=myname_str^100 & q=Frank
>
>
>
> -Hoss
>

Re: Exact match not the first result returned

Posted by Brian Lamb <br...@journalexperts.com>.

I implemented both solutions Hoss suggested and was able to achieve the
desired results. I would like to go with

 defType=dismax & qf=myname & pf=myname_str^100 & q=Frank

but that doesn't seem to work if I have a query like myname:Frank
otherfield:something. So I think I will go with

q=+myname:Frank myname_str:"Frank"^100

Thanks for the help everyone!

Brian Lamb

On Wed, Jul 27, 2011 at 10:55 PM, Chris Hostetter
<ho...@fucit.org>wrote:

>
> : With your solution, RECORD 1 does appear at the top but I think thats
> just
> : blind luck more than anything else because RECORD 3 shows as having the
> same
> : score. So what more can I do to push RECORD 1 up to the top. Ideally, I'd
> : like all three records returned with RECORD 1 being the first listing.
>
> with omitNorms RECORD1 and RECORD3 have the same score because only the
> tf() matters, and both docs contain the term "frank" exactly twice.
>
> the reason RECORD1 isn't scoring higher even though it contains (as you
> put it "matchings 'Fred' exactly" is that from a term perspective, RECORD1
> doesn't actually match "myname:Fred" exactly, because there are in fact
> other terms in that field because it's multivalued.
>
> one way to indicate that you (only* want documents where entire field
> values to match your input (ie: RECORD1 but no other records) would be to
> use a StrField instead of a TextField or an analyzer that doesn't split up
> tokens (lie: something using KeywordTokenizer).  that way a query on
> myname:Frank would not match a document where you had indexed the value
> "Frank Stalone" by a query for myname:"Frank Stalone" would.
>
> in your case, you don't want *only* the exact field value matches, but you
> want them boosted, so you could do something like copyField "myname" into
> "myname_str" and then do...
>
>  q=+myname:Frank myname_str:"Frank"^100
>
> ...in which case a match on "myname" is required, but a match on
> "myname_str" will greatly increase the score.
>
> dismax (and edismax) are really designed for situations like this...
>
>  defType=dismax & qf=myname & pf=myname_str^100 & q=Frank
>
>
>
> -Hoss
>

Re: Exact match not the first result returned

Posted by Chris Hostetter <ho...@fucit.org>.

: With your solution, RECORD 1 does appear at the top but I think thats just
: blind luck more than anything else because RECORD 3 shows as having the same
: score. So what more can I do to push RECORD 1 up to the top. Ideally, I'd
: like all three records returned with RECORD 1 being the first listing.

with omitNorms RECORD1 and RECORD3 have the same score because only the 
tf() matters, and both docs contain the term "frank" exactly twice.

the reason RECORD1 isn't scoring higher even though it contains (as you 
put it "matchings 'Fred' exactly" is that from a term perspective, RECORD1 
doesn't actually match "myname:Fred" exactly, because there are in fact 
other terms in that field because it's multivalued.

one way to indicate that you (only* want documents where entire field 
values to match your input (ie: RECORD1 but no other records) would be to 
use a StrField instead of a TextField or an analyzer that doesn't split up 
tokens (lie: something using KeywordTokenizer).  that way a query on 
myname:Frank would not match a document where you had indexed the value 
"Frank Stalone" by a query for myname:"Frank Stalone" would.

in your case, you don't want *only* the exact field value matches, but you 
want them boosted, so you could do something like copyField "myname" into 
"myname_str" and then do...

  q=+myname:Frank myname_str:"Frank"^100

...in which case a match on "myname" is required, but a match on 
"myname_str" will greatly increase the score.

dismax (and edismax) are really designed for situations like this...

  defType=dismax & qf=myname & pf=myname_str^100 & q=Frank



-Hoss

Re: Exact match not the first result returned

Posted by Brian Lamb <br...@journalexperts.com>.

Thanks Emmanuel for that explanation. I implemented your solution but I'm
not quite there yet. Suppose I also have a record:

RECORD 3
<arr name="myname">
  <str>Fred G. Anderson</str>
  <str>Fred Anderson</str>
</arr>

With your solution, RECORD 1 does appear at the top but I think thats just
blind luck more than anything else because RECORD 3 shows as having the same
score. So what more can I do to push RECORD 1 up to the top. Ideally, I'd
like all three records returned with RECORD 1 being the first listing.

Thanks,

Brian Lamb

On Tue, Jul 26, 2011 at 6:03 PM, Emmanuel Espina
<es...@gmail.com>wrote:

> That is caused by the size of the documents. The principle is pretty
> intuitive if one of your documents is the entire three volumes of The Lord
> of the Rings, and you search for "tree" I know that The Lord of the Rings
> will be in the results, and I haven't memorized the entire text of that
> book
> :p
> It is a matter of probability that if you have a big (big!) text any word
> will have a greater chance to be found than in a smaller letter. So one can
> infer that the letter is more relevant than the big text. That is the
> principle applied here and Lucene does that when building the ranking.
> The first document is bigger (remember that all the values of a multivalued
> field are merged into one field in the index, so you can not tell one value
> from another apart) than the second one. In the first one you have
> [Fred, coolest,
> guy, town] and in the second [Fred, Anderson], so the second document is
> more relevant than the first one.
>
> To avoid all this procedure you can set omitNorms to true and that should
> make the first document more relevant because Fred appears twice (not
> because Fred appears alone in a value)
>
> Regards
> Emmanuel
>
> 2011/7/26 Brian Lamb <br...@journalexperts.com>
>
> > Hi all,
> >
> > I am a little confused as to why the scoring is working the way it is:
> >
> > I have a field defined as:
> >
> > <field name="myname" type="text" indexed="true" stored="true"
> > required="false" multivalued="true" />
> >
> > And I have several documents where that value is:
> >
> > RECORD 1
> > <arr name="myname">
> >  <str>Fred</str>
> >  <str>Fred (the coolest guy in town)</str>
> > </arr>
> >
> > OR
> >
> > RECORD 2
> > <arr name="myname">
> >  <str>Fred Anderson</str>
> > </arr>
> >
> > What happens when I do a search for
> > http://localhost:8983/solr/search/?q=myname:Fred I get RECORD 2
> > returned before RECORD 1.
> >
> > RECORD 2
> > 5.282213 = (MATCH) fieldWeight(myname:Fred in 256575), product of:
> >  1.0 = tf(termFreq(myname:Fred)=1)
> >  8.451541 = idf(docFreq=7306, maxDocs=12586425)
> >  0.625 = fieldNorm(field=myname, doc=256575)
> >
> > RECORD 1
> > 4.482106 = (MATCH) fieldWeight(myname:Fred in 215), product of:
> >  1.4142135 = tf(termFreq(myname:Fred)=2)
> >  8.451541 = idf(docFreq=7306, maxDocs=12586425)
> >  0.375 = fieldNorm(field=myname, doc=215)
> >
> > So the difference is fieldNorm obviously but I think that's only part
> > of the story. Why is RECORD 2 returned with a higher score than RECORD
> > 1 even though RECORD 1 matches "Fred" exactly? And how should I do
> > this differently so that I am getting the results I am expecting?
> >
> > Thanks,
> >
> > Brian Lamb
> >
>

Re: Exact match not the first result returned

Posted by Emmanuel Espina <es...@gmail.com>.

That is caused by the size of the documents. The principle is pretty
intuitive if one of your documents is the entire three volumes of The Lord
of the Rings, and you search for "tree" I know that The Lord of the Rings
will be in the results, and I haven't memorized the entire text of that book
:p
It is a matter of probability that if you have a big (big!) text any word
will have a greater chance to be found than in a smaller letter. So one can
infer that the letter is more relevant than the big text. That is the
principle applied here and Lucene does that when building the ranking.
The first document is bigger (remember that all the values of a multivalued
field are merged into one field in the index, so you can not tell one value
from another apart) than the second one. In the first one you have
[Fred, coolest,
guy, town] and in the second [Fred, Anderson], so the second document is
more relevant than the first one.

To avoid all this procedure you can set omitNorms to true and that should
make the first document more relevant because Fred appears twice (not
because Fred appears alone in a value)

Regards
Emmanuel

2011/7/26 Brian Lamb <br...@journalexperts.com>

> Hi all,
>
> I am a little confused as to why the scoring is working the way it is:
>
> I have a field defined as:
>
> <field name="myname" type="text" indexed="true" stored="true"
> required="false" multivalued="true" />
>
> And I have several documents where that value is:
>
> RECORD 1
> <arr name="myname">
>  <str>Fred</str>
>  <str>Fred (the coolest guy in town)</str>
> </arr>
>
> OR
>
> RECORD 2
> <arr name="myname">
>  <str>Fred Anderson</str>
> </arr>
>
> What happens when I do a search for
> http://localhost:8983/solr/search/?q=myname:Fred I get RECORD 2
> returned before RECORD 1.
>
> RECORD 2
> 5.282213 = (MATCH) fieldWeight(myname:Fred in 256575), product of:
>  1.0 = tf(termFreq(myname:Fred)=1)
>  8.451541 = idf(docFreq=7306, maxDocs=12586425)
>  0.625 = fieldNorm(field=myname, doc=256575)
>
> RECORD 1
> 4.482106 = (MATCH) fieldWeight(myname:Fred in 215), product of:
>  1.4142135 = tf(termFreq(myname:Fred)=2)
>  8.451541 = idf(docFreq=7306, maxDocs=12586425)
>  0.375 = fieldNorm(field=myname, doc=215)
>
> So the difference is fieldNorm obviously but I think that's only part
> of the story. Why is RECORD 2 returned with a higher score than RECORD
> 1 even though RECORD 1 matches "Fred" exactly? And how should I do
> this differently so that I am getting the results I am expecting?
>
> Thanks,
>
> Brian Lamb
>