You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by al...@aim.com on 2012/04/10 20:46:42 UTC

term frequency outweighs exact phrase match

Hello,

I use solr 3.5 with edismax. I have the following issue with phrase search. For example if I have three documents with content like

1.apache apache
2. solr solr
3.apache solr

then search for apache solr displays documents in the order 1,.2,3 instead of 3, 2, 1 because term frequency in the first and second documents is higher than in the third document. We want results be displayed in the order as  3,2,1 since the third document has exact match.

My request handler is as follows.

<requestHandler name="search" class="solr.SearchHandler" >
<lst name="defaults">
<str name="defType">edismax</str>
<str name="echoParams">explicit</str>
<float name="tie">0.01</float>
<str name="qf">host^30  content^0.5 title^1.2</str>
<str name="pf">host^30  content^20 title^22 </str>
<str name="fl">url,id, site ,title</str>
<str name="mm">2&lt;-1 5&lt;-2 6&lt;90%</str>
<int name="ps">1</int>
<bool name="hl">true</bool>
<str name="q.alt">*:*</str>
<str name="hl.fl">content</str>
<str name="f.title.hl.fragsize">0</str>
<str name="hl.fragsize">165</str>
<str name="f.title.hl.alternateField">title</str>
<str name="f.url.hl.fragsize">0</str>
<str name="f.url.hl.alternateField">url</str>
<str name="f.content.hl.fragmenter">regex</str>
<str name="spellcheck">true</str>
<str name="spellcheck.collate">true</str>
<str name="spellcheck.count">5</str>
<str name="group">true</str>
<str name="group.field">site</str>
<str name="group.ngroups">true</str>
</lst>
<arr name="last-components">
 <str>spellcheck</str>
</arr>
</requestHandler>

Any ideas how to fix this issue?

Thanks in advance.
Alex.

Re: term frequency outweighs exact phrase match

Posted by al...@aim.com.
Hello Hoss,

Here are the explain tags for two doc

<str name="a0127d8e70a6d523">
0.021646015 = (MATCH) sum of:
  0.021646015 = (MATCH) sum of:
    0.02141003 = (MATCH) max plus 0.01 times others of:
      2.84194E-4 = (MATCH) weight(content:apache^0.5 in 3578), product of:
        0.0029881175 = queryWeight(content:apache^0.5), product of:
          0.5 = boost
          4.3554416 = idf(docFreq=126092, maxDocs=3613605)
          0.0013721307 = queryNorm
        0.09510804 = (MATCH) fieldWeight(content:apache in 3578), product of:
          2.236068 = tf(termFreq(content:apache)=5)
          4.3554416 = idf(docFreq=126092, maxDocs=3613605)
          0.009765625 = fieldNorm(field=content, doc=3578)
      0.021407187 = (MATCH) weight(title:apache^1.2 in 3578), product of:
        0.01371095 = queryWeight(title:apache^1.2), product of:
          1.2 = boost
          8.327043 = idf(docFreq=2375, maxDocs=3613605)
          0.0013721307 = queryNorm
        1.5613205 = (MATCH) fieldWeight(title:apache in 3578), product of:
          1.0 = tf(termFreq(title:apache)=1)
          8.327043 = idf(docFreq=2375, maxDocs=3613605)
          0.1875 = fieldNorm(field=title, doc=3578)
    2.359865E-4 = (MATCH) max plus 0.01 times others of:
      2.359865E-4 = (MATCH) weight(content:solr^0.5 in 3578), product of:
        0.004071705 = queryWeight(content:solr^0.5), product of:
          0.5 = boost
          5.9348645 = idf(docFreq=25986, maxDocs=3613605)
          0.0013721307 = queryNorm
        0.05795766 = (MATCH) fieldWeight(content:solr in 3578), product of:
          1.0 = tf(termFreq(content:solr)=1)
          5.9348645 = idf(docFreq=25986, maxDocs=3613605)
          0.009765625 = fieldNorm(field=content, doc=3578)
</str><str name="d89380e313c64aa5">
0.021465056 = (MATCH) sum of:
  1.8154096E-4 = (MATCH) sum of:
    6.354771E-5 = (MATCH) max plus 0.01 times others of:
      6.354771E-5 = (MATCH) weight(content:apache^0.5 in 638040), product of:
        0.0029881175 = queryWeight(content:apache^0.5), product of:
          0.5 = boost
          4.3554416 = idf(docFreq=126092, maxDocs=3613605)
          0.0013721307 = queryNorm
        0.021266805 = (MATCH) fieldWeight(content:apache in 638040), product of:
          1.0 = tf(termFreq(content:apache)=1)
          4.3554416 = idf(docFreq=126092, maxDocs=3613605)
          0.0048828125 = fieldNorm(field=content, doc=638040)
    1.1799325E-4 = (MATCH) max plus 0.01 times others of:
      1.1799325E-4 = (MATCH) weight(content:solr^0.5 in 638040), product of:
        0.004071705 = queryWeight(content:solr^0.5), product of:
          0.5 = boost
          5.9348645 = idf(docFreq=25986, maxDocs=3613605)
          0.0013721307 = queryNorm
        0.02897883 = (MATCH) fieldWeight(content:solr in 638040), product of:
          1.0 = tf(termFreq(content:solr)=1)
          5.9348645 = idf(docFreq=25986, maxDocs=3613605)
          0.0048828125 = fieldNorm(field=content, doc=638040)
  0.021283515 = (MATCH) weight(content:"apache solr"~1^30.0 in 638040), product of:
    0.42358932 = queryWeight(content:"apache solr"~1^30.0), product of:
      30.0 = boost
      10.290306 = idf(content: apache=126092 solr=25986)
      0.0013721307 = queryNorm
    0.050245635 = fieldWeight(content:"apache solr" in 638040), product of:
      1.0 = tf(phraseFreq=1.0)
      10.290306 = idf(content: apache=126092 solr=25986)
      0.0048828125 = fieldNorm(field=content, doc=638040)
</str>

 

 

 Although the second doc has exact match it is placed after the first one which does not have exact match.

I use the following request handler

<requestHandler name="search" class="solr.SearchHandler" >
<lst name="defaults">
<str name="defType">edismax</str>
<str name="echoParams">explicit</str>
<float name="tie">0.01</float>
<str name="qf">host^30  content^0.5 title^1.2 anchor^1.2</str>
<str name="pf">content^30</str>
<str name="fl">url,id, site ,title</str>
<str name="mm">2&lt;-1 5&lt;-2 6&lt;90%</str>
<int name="ps">1</int>
<bool name="hl">true</bool>
<str name="q.alt">*:*</str>
<str name="hl.fl">content</str>
<str name="f.title.hl.fragsize">0</str>
<str name="hl.fragsize">165</str>
<str name="f.title.hl.alternateField">title</str>
<str name="f.url.hl.fragsize">0</str>
<str name="f.url.hl.alternateField">url</str>
<str name="f.content.hl.fragmenter">regex</str>
<str name="spellcheck">true</str>
<str name="spellcheck.collate">true</str>
<str name="spellcheck.count">5</str>
<str name="group">true</str>
<str name="group.field">site</str>
<str name="group.ngroups">true</str>
</lst>
<arr name="last-components">
 <str>spellcheck</str>
</arr>
</requestHandler>


and the query is as follows 

http://localhost:8983/solr/select/?q=apache solr&version=2.2&start=0&rows=10&indent=on&qt=search&debugQuery=true

Thanks.
Alex.


-----Original Message-----
From: Chris Hostetter <ho...@fucit.org>
To: solr-user <so...@lucene.apache.org>
Sent: Thu, Apr 12, 2012 7:43 pm
Subject: Re: term frequency outweighs exact phrase match



: I use solr 3.5 with edismax. I have the following issue with phrase 
: search. For example if I have three documents with content like
: 
: 1.apache apache
: 2. solr solr
: 3.apache solr
: 
: then search for apache solr displays documents in the order 1,.2,3 
: instead of 3, 2, 1 because term frequency in the first and second 
: documents is higher than in the third document. We want results be 
: displayed in the order as 3,2,1 since the third document has exact 
: match.

you need to give us a lot more info, like what other data is in the 
various fields for those documents, exactly what your query URL looks 
like, and what debugQuery=true gives you back in terms of score 
explanations ofr each document, because if that sample content is the only 
thing you've got indexed (even if it's in multiple fields), then documents 
#1 and #2 shouldn't even match your query using the mm you've specified...

: <str name="mm">2&lt;-1 5&lt;-2 6&lt;90%</str>

...because doc #1 and #2 will only contain one clause.

Otherwise it should work fine.

I used the example 3.5 schema, and created 3 docs matching what you 
described. (with name copyfield'ed into text)...

<add>
<doc><field name="id">1</field><field name="name">apache apache</field></doc>
<doc><field name="id">2</field><field name="name">solr solr</field></doc>
<doc><field name="id">3</field><field name="name">apache solr</field></doc>
</add>

...and then used this similar query (note mm=1) to get the results you 
would expect...

http://localhost:8983/solr/select/?fl=name,score&debugQuery=true&defType=edismax&qf=name+text&pf=name^10+text^5&q=apache%20solr&mm=1

<result name="response" numFound="3" start="0" maxScore="1.309231">
<doc>
<float name="score">1.309231</float>
<str name="name">apache solr</str>
</doc>
<doc>
<float name="score">0.022042051</float>
<str name="name">apache apache</str>
</doc>
<doc>
<float name="score">0.022042051</float>
<str name="name">solr solr</str>
</doc>
</result>


-Hoss

 

Re: term frequency outweighs exact phrase match

Posted by Chris Hostetter <ho...@fucit.org>.
: I use solr 3.5 with edismax. I have the following issue with phrase 
: search. For example if I have three documents with content like
: 
: 1.apache apache
: 2. solr solr
: 3.apache solr
: 
: then search for apache solr displays documents in the order 1,.2,3 
: instead of 3, 2, 1 because term frequency in the first and second 
: documents is higher than in the third document. We want results be 
: displayed in the order as 3,2,1 since the third document has exact 
: match.

you need to give us a lot more info, like what other data is in the 
various fields for those documents, exactly what your query URL looks 
like, and what debugQuery=true gives you back in terms of score 
explanations ofr each document, because if that sample content is the only 
thing you've got indexed (even if it's in multiple fields), then documents 
#1 and #2 shouldn't even match your query using the mm you've specified...

: <str name="mm">2&lt;-1 5&lt;-2 6&lt;90%</str>

...because doc #1 and #2 will only contain one clause.

Otherwise it should work fine.

I used the example 3.5 schema, and created 3 docs matching what you 
described. (with name copyfield'ed into text)...

<add>
<doc><field name="id">1</field><field name="name">apache apache</field></doc>
<doc><field name="id">2</field><field name="name">solr solr</field></doc>
<doc><field name="id">3</field><field name="name">apache solr</field></doc>
</add>

...and then used this similar query (note mm=1) to get the results you 
would expect...

http://localhost:8983/solr/select/?fl=name,score&debugQuery=true&defType=edismax&qf=name+text&pf=name^10+text^5&q=apache%20solr&mm=1

<result name="response" numFound="3" start="0" maxScore="1.309231">
<doc>
<float name="score">1.309231</float>
<str name="name">apache solr</str>
</doc>
<doc>
<float name="score">0.022042051</float>
<str name="name">apache apache</str>
</doc>
<doc>
<float name="score">0.022042051</float>
<str name="name">solr solr</str>
</doc>
</result>


-Hoss

Re: term frequency outweighs exact phrase match

Posted by al...@aim.com.
In that case documents 1 and 2 will not be in the results. We need them also be shown in the results but be ranked after those docs with exact match.
I think omitting term frequency in calculating ranking in phrase queries will solve this issue, but I do not see that such a parameter in configs.
I see omitTermFreqAndPositions="true" but not sure if it is the setting I need, because its description is too vague.

Thanks.
Alex.


 

 

 

-----Original Message-----
From: Erick Erickson <er...@gmail.com>
To: solr-user <so...@lucene.apache.org>
Sent: Wed, Apr 11, 2012 8:23 am
Subject: Re: term frequency outweighs exact phrase match


Consider boosting on phrase with a SHOULD clause, something
like field:"apache solr"^2..

Best
Erick


On Tue, Apr 10, 2012 at 12:46 PM,  <al...@aim.com> wrote:
> Hello,
>
> I use solr 3.5 with edismax. I have the following issue with phrase search. 
For example if I have three documents with content like
>
> 1.apache apache
> 2. solr solr
> 3.apache solr
>
> then search for apache solr displays documents in the order 1,.2,3 instead of 
3, 2, 1 because term frequency in the first and second documents is higher than 
in the third document. We want results be displayed in the order as  3,2,1 since 
the third document has exact match.
>
> My request handler is as follows.
>
> <requestHandler name="search" class="solr.SearchHandler" >
> <lst name="defaults">
> <str name="defType">edismax</str>
> <str name="echoParams">explicit</str>
> <float name="tie">0.01</float>
> <str name="qf">host^30  content^0.5 title^1.2</str>
> <str name="pf">host^30  content^20 title^22 </str>
> <str name="fl">url,id, site ,title</str>
> <str name="mm">2&lt;-1 5&lt;-2 6&lt;90%</str>
> <int name="ps">1</int>
> <bool name="hl">true</bool>
> <str name="q.alt">*:*</str>
> <str name="hl.fl">content</str>
> <str name="f.title.hl.fragsize">0</str>
> <str name="hl.fragsize">165</str>
> <str name="f.title.hl.alternateField">title</str>
> <str name="f.url.hl.fragsize">0</str>
> <str name="f.url.hl.alternateField">url</str>
> <str name="f.content.hl.fragmenter">regex</str>
> <str name="spellcheck">true</str>
> <str name="spellcheck.collate">true</str>
> <str name="spellcheck.count">5</str>
> <str name="group">true</str>
> <str name="group.field">site</str>
> <str name="group.ngroups">true</str>
> </lst>
> <arr name="last-components">
>  <str>spellcheck</str>
> </arr>
> </requestHandler>
>
> Any ideas how to fix this issue?
>
> Thanks in advance.
> Alex.

 

Re: term frequency outweighs exact phrase match

Posted by Erick Erickson <er...@gmail.com>.
Consider boosting on phrase with a SHOULD clause, something
like field:"apache solr"^2..

Best
Erick


On Tue, Apr 10, 2012 at 12:46 PM,  <al...@aim.com> wrote:
> Hello,
>
> I use solr 3.5 with edismax. I have the following issue with phrase search. For example if I have three documents with content like
>
> 1.apache apache
> 2. solr solr
> 3.apache solr
>
> then search for apache solr displays documents in the order 1,.2,3 instead of 3, 2, 1 because term frequency in the first and second documents is higher than in the third document. We want results be displayed in the order as  3,2,1 since the third document has exact match.
>
> My request handler is as follows.
>
> <requestHandler name="search" class="solr.SearchHandler" >
> <lst name="defaults">
> <str name="defType">edismax</str>
> <str name="echoParams">explicit</str>
> <float name="tie">0.01</float>
> <str name="qf">host^30  content^0.5 title^1.2</str>
> <str name="pf">host^30  content^20 title^22 </str>
> <str name="fl">url,id, site ,title</str>
> <str name="mm">2&lt;-1 5&lt;-2 6&lt;90%</str>
> <int name="ps">1</int>
> <bool name="hl">true</bool>
> <str name="q.alt">*:*</str>
> <str name="hl.fl">content</str>
> <str name="f.title.hl.fragsize">0</str>
> <str name="hl.fragsize">165</str>
> <str name="f.title.hl.alternateField">title</str>
> <str name="f.url.hl.fragsize">0</str>
> <str name="f.url.hl.alternateField">url</str>
> <str name="f.content.hl.fragmenter">regex</str>
> <str name="spellcheck">true</str>
> <str name="spellcheck.collate">true</str>
> <str name="spellcheck.count">5</str>
> <str name="group">true</str>
> <str name="group.field">site</str>
> <str name="group.ngroups">true</str>
> </lst>
> <arr name="last-components">
>  <str>spellcheck</str>
> </arr>
> </requestHandler>
>
> Any ideas how to fix this issue?
>
> Thanks in advance.
> Alex.