You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by al...@aim.com on 2012/04/10 20:46:42 UTC
term frequency outweighs exact phrase match
Hello,
I use solr 3.5 with edismax. I have the following issue with phrase search. For example if I have three documents with content like
1.apache apache
2. solr solr
3.apache solr
then search for apache solr displays documents in the order 1,.2,3 instead of 3, 2, 1 because term frequency in the first and second documents is higher than in the third document. We want results be displayed in the order as 3,2,1 since the third document has exact match.
My request handler is as follows.
<requestHandler name="search" class="solr.SearchHandler" >
<lst name="defaults">
<str name="defType">edismax</str>
<str name="echoParams">explicit</str>
<float name="tie">0.01</float>
<str name="qf">host^30 content^0.5 title^1.2</str>
<str name="pf">host^30 content^20 title^22 </str>
<str name="fl">url,id, site ,title</str>
<str name="mm">2<-1 5<-2 6<90%</str>
<int name="ps">1</int>
<bool name="hl">true</bool>
<str name="q.alt">*:*</str>
<str name="hl.fl">content</str>
<str name="f.title.hl.fragsize">0</str>
<str name="hl.fragsize">165</str>
<str name="f.title.hl.alternateField">title</str>
<str name="f.url.hl.fragsize">0</str>
<str name="f.url.hl.alternateField">url</str>
<str name="f.content.hl.fragmenter">regex</str>
<str name="spellcheck">true</str>
<str name="spellcheck.collate">true</str>
<str name="spellcheck.count">5</str>
<str name="group">true</str>
<str name="group.field">site</str>
<str name="group.ngroups">true</str>
</lst>
<arr name="last-components">
<str>spellcheck</str>
</arr>
</requestHandler>
Any ideas how to fix this issue?
Thanks in advance.
Alex.
Re: term frequency outweighs exact phrase match
Posted by al...@aim.com.
Hello Hoss,
Here are the explain tags for two doc
<str name="a0127d8e70a6d523">
0.021646015 = (MATCH) sum of:
0.021646015 = (MATCH) sum of:
0.02141003 = (MATCH) max plus 0.01 times others of:
2.84194E-4 = (MATCH) weight(content:apache^0.5 in 3578), product of:
0.0029881175 = queryWeight(content:apache^0.5), product of:
0.5 = boost
4.3554416 = idf(docFreq=126092, maxDocs=3613605)
0.0013721307 = queryNorm
0.09510804 = (MATCH) fieldWeight(content:apache in 3578), product of:
2.236068 = tf(termFreq(content:apache)=5)
4.3554416 = idf(docFreq=126092, maxDocs=3613605)
0.009765625 = fieldNorm(field=content, doc=3578)
0.021407187 = (MATCH) weight(title:apache^1.2 in 3578), product of:
0.01371095 = queryWeight(title:apache^1.2), product of:
1.2 = boost
8.327043 = idf(docFreq=2375, maxDocs=3613605)
0.0013721307 = queryNorm
1.5613205 = (MATCH) fieldWeight(title:apache in 3578), product of:
1.0 = tf(termFreq(title:apache)=1)
8.327043 = idf(docFreq=2375, maxDocs=3613605)
0.1875 = fieldNorm(field=title, doc=3578)
2.359865E-4 = (MATCH) max plus 0.01 times others of:
2.359865E-4 = (MATCH) weight(content:solr^0.5 in 3578), product of:
0.004071705 = queryWeight(content:solr^0.5), product of:
0.5 = boost
5.9348645 = idf(docFreq=25986, maxDocs=3613605)
0.0013721307 = queryNorm
0.05795766 = (MATCH) fieldWeight(content:solr in 3578), product of:
1.0 = tf(termFreq(content:solr)=1)
5.9348645 = idf(docFreq=25986, maxDocs=3613605)
0.009765625 = fieldNorm(field=content, doc=3578)
</str><str name="d89380e313c64aa5">
0.021465056 = (MATCH) sum of:
1.8154096E-4 = (MATCH) sum of:
6.354771E-5 = (MATCH) max plus 0.01 times others of:
6.354771E-5 = (MATCH) weight(content:apache^0.5 in 638040), product of:
0.0029881175 = queryWeight(content:apache^0.5), product of:
0.5 = boost
4.3554416 = idf(docFreq=126092, maxDocs=3613605)
0.0013721307 = queryNorm
0.021266805 = (MATCH) fieldWeight(content:apache in 638040), product of:
1.0 = tf(termFreq(content:apache)=1)
4.3554416 = idf(docFreq=126092, maxDocs=3613605)
0.0048828125 = fieldNorm(field=content, doc=638040)
1.1799325E-4 = (MATCH) max plus 0.01 times others of:
1.1799325E-4 = (MATCH) weight(content:solr^0.5 in 638040), product of:
0.004071705 = queryWeight(content:solr^0.5), product of:
0.5 = boost
5.9348645 = idf(docFreq=25986, maxDocs=3613605)
0.0013721307 = queryNorm
0.02897883 = (MATCH) fieldWeight(content:solr in 638040), product of:
1.0 = tf(termFreq(content:solr)=1)
5.9348645 = idf(docFreq=25986, maxDocs=3613605)
0.0048828125 = fieldNorm(field=content, doc=638040)
0.021283515 = (MATCH) weight(content:"apache solr"~1^30.0 in 638040), product of:
0.42358932 = queryWeight(content:"apache solr"~1^30.0), product of:
30.0 = boost
10.290306 = idf(content: apache=126092 solr=25986)
0.0013721307 = queryNorm
0.050245635 = fieldWeight(content:"apache solr" in 638040), product of:
1.0 = tf(phraseFreq=1.0)
10.290306 = idf(content: apache=126092 solr=25986)
0.0048828125 = fieldNorm(field=content, doc=638040)
</str>
Although the second doc has exact match it is placed after the first one which does not have exact match.
I use the following request handler
<requestHandler name="search" class="solr.SearchHandler" >
<lst name="defaults">
<str name="defType">edismax</str>
<str name="echoParams">explicit</str>
<float name="tie">0.01</float>
<str name="qf">host^30 content^0.5 title^1.2 anchor^1.2</str>
<str name="pf">content^30</str>
<str name="fl">url,id, site ,title</str>
<str name="mm">2<-1 5<-2 6<90%</str>
<int name="ps">1</int>
<bool name="hl">true</bool>
<str name="q.alt">*:*</str>
<str name="hl.fl">content</str>
<str name="f.title.hl.fragsize">0</str>
<str name="hl.fragsize">165</str>
<str name="f.title.hl.alternateField">title</str>
<str name="f.url.hl.fragsize">0</str>
<str name="f.url.hl.alternateField">url</str>
<str name="f.content.hl.fragmenter">regex</str>
<str name="spellcheck">true</str>
<str name="spellcheck.collate">true</str>
<str name="spellcheck.count">5</str>
<str name="group">true</str>
<str name="group.field">site</str>
<str name="group.ngroups">true</str>
</lst>
<arr name="last-components">
<str>spellcheck</str>
</arr>
</requestHandler>
and the query is as follows
http://localhost:8983/solr/select/?q=apache solr&version=2.2&start=0&rows=10&indent=on&qt=search&debugQuery=true
Thanks.
Alex.
-----Original Message-----
From: Chris Hostetter <ho...@fucit.org>
To: solr-user <so...@lucene.apache.org>
Sent: Thu, Apr 12, 2012 7:43 pm
Subject: Re: term frequency outweighs exact phrase match
: I use solr 3.5 with edismax. I have the following issue with phrase
: search. For example if I have three documents with content like
:
: 1.apache apache
: 2. solr solr
: 3.apache solr
:
: then search for apache solr displays documents in the order 1,.2,3
: instead of 3, 2, 1 because term frequency in the first and second
: documents is higher than in the third document. We want results be
: displayed in the order as 3,2,1 since the third document has exact
: match.
you need to give us a lot more info, like what other data is in the
various fields for those documents, exactly what your query URL looks
like, and what debugQuery=true gives you back in terms of score
explanations ofr each document, because if that sample content is the only
thing you've got indexed (even if it's in multiple fields), then documents
#1 and #2 shouldn't even match your query using the mm you've specified...
: <str name="mm">2<-1 5<-2 6<90%</str>
...because doc #1 and #2 will only contain one clause.
Otherwise it should work fine.
I used the example 3.5 schema, and created 3 docs matching what you
described. (with name copyfield'ed into text)...
<add>
<doc><field name="id">1</field><field name="name">apache apache</field></doc>
<doc><field name="id">2</field><field name="name">solr solr</field></doc>
<doc><field name="id">3</field><field name="name">apache solr</field></doc>
</add>
...and then used this similar query (note mm=1) to get the results you
would expect...
http://localhost:8983/solr/select/?fl=name,score&debugQuery=true&defType=edismax&qf=name+text&pf=name^10+text^5&q=apache%20solr&mm=1
<result name="response" numFound="3" start="0" maxScore="1.309231">
<doc>
<float name="score">1.309231</float>
<str name="name">apache solr</str>
</doc>
<doc>
<float name="score">0.022042051</float>
<str name="name">apache apache</str>
</doc>
<doc>
<float name="score">0.022042051</float>
<str name="name">solr solr</str>
</doc>
</result>
-Hoss
Re: term frequency outweighs exact phrase match
Posted by Chris Hostetter <ho...@fucit.org>.
: I use solr 3.5 with edismax. I have the following issue with phrase
: search. For example if I have three documents with content like
:
: 1.apache apache
: 2. solr solr
: 3.apache solr
:
: then search for apache solr displays documents in the order 1,.2,3
: instead of 3, 2, 1 because term frequency in the first and second
: documents is higher than in the third document. We want results be
: displayed in the order as 3,2,1 since the third document has exact
: match.
you need to give us a lot more info, like what other data is in the
various fields for those documents, exactly what your query URL looks
like, and what debugQuery=true gives you back in terms of score
explanations ofr each document, because if that sample content is the only
thing you've got indexed (even if it's in multiple fields), then documents
#1 and #2 shouldn't even match your query using the mm you've specified...
: <str name="mm">2<-1 5<-2 6<90%</str>
...because doc #1 and #2 will only contain one clause.
Otherwise it should work fine.
I used the example 3.5 schema, and created 3 docs matching what you
described. (with name copyfield'ed into text)...
<add>
<doc><field name="id">1</field><field name="name">apache apache</field></doc>
<doc><field name="id">2</field><field name="name">solr solr</field></doc>
<doc><field name="id">3</field><field name="name">apache solr</field></doc>
</add>
...and then used this similar query (note mm=1) to get the results you
would expect...
http://localhost:8983/solr/select/?fl=name,score&debugQuery=true&defType=edismax&qf=name+text&pf=name^10+text^5&q=apache%20solr&mm=1
<result name="response" numFound="3" start="0" maxScore="1.309231">
<doc>
<float name="score">1.309231</float>
<str name="name">apache solr</str>
</doc>
<doc>
<float name="score">0.022042051</float>
<str name="name">apache apache</str>
</doc>
<doc>
<float name="score">0.022042051</float>
<str name="name">solr solr</str>
</doc>
</result>
-Hoss
Re: term frequency outweighs exact phrase match
Posted by al...@aim.com.
In that case documents 1 and 2 will not be in the results. We need them also be shown in the results but be ranked after those docs with exact match.
I think omitting term frequency in calculating ranking in phrase queries will solve this issue, but I do not see that such a parameter in configs.
I see omitTermFreqAndPositions="true" but not sure if it is the setting I need, because its description is too vague.
Thanks.
Alex.
-----Original Message-----
From: Erick Erickson <er...@gmail.com>
To: solr-user <so...@lucene.apache.org>
Sent: Wed, Apr 11, 2012 8:23 am
Subject: Re: term frequency outweighs exact phrase match
Consider boosting on phrase with a SHOULD clause, something
like field:"apache solr"^2..
Best
Erick
On Tue, Apr 10, 2012 at 12:46 PM, <al...@aim.com> wrote:
> Hello,
>
> I use solr 3.5 with edismax. I have the following issue with phrase search.
For example if I have three documents with content like
>
> 1.apache apache
> 2. solr solr
> 3.apache solr
>
> then search for apache solr displays documents in the order 1,.2,3 instead of
3, 2, 1 because term frequency in the first and second documents is higher than
in the third document. We want results be displayed in the order as 3,2,1 since
the third document has exact match.
>
> My request handler is as follows.
>
> <requestHandler name="search" class="solr.SearchHandler" >
> <lst name="defaults">
> <str name="defType">edismax</str>
> <str name="echoParams">explicit</str>
> <float name="tie">0.01</float>
> <str name="qf">host^30 content^0.5 title^1.2</str>
> <str name="pf">host^30 content^20 title^22 </str>
> <str name="fl">url,id, site ,title</str>
> <str name="mm">2<-1 5<-2 6<90%</str>
> <int name="ps">1</int>
> <bool name="hl">true</bool>
> <str name="q.alt">*:*</str>
> <str name="hl.fl">content</str>
> <str name="f.title.hl.fragsize">0</str>
> <str name="hl.fragsize">165</str>
> <str name="f.title.hl.alternateField">title</str>
> <str name="f.url.hl.fragsize">0</str>
> <str name="f.url.hl.alternateField">url</str>
> <str name="f.content.hl.fragmenter">regex</str>
> <str name="spellcheck">true</str>
> <str name="spellcheck.collate">true</str>
> <str name="spellcheck.count">5</str>
> <str name="group">true</str>
> <str name="group.field">site</str>
> <str name="group.ngroups">true</str>
> </lst>
> <arr name="last-components">
> <str>spellcheck</str>
> </arr>
> </requestHandler>
>
> Any ideas how to fix this issue?
>
> Thanks in advance.
> Alex.
Re: term frequency outweighs exact phrase match
Posted by Erick Erickson <er...@gmail.com>.
Consider boosting on phrase with a SHOULD clause, something
like field:"apache solr"^2..
Best
Erick
On Tue, Apr 10, 2012 at 12:46 PM, <al...@aim.com> wrote:
> Hello,
>
> I use solr 3.5 with edismax. I have the following issue with phrase search. For example if I have three documents with content like
>
> 1.apache apache
> 2. solr solr
> 3.apache solr
>
> then search for apache solr displays documents in the order 1,.2,3 instead of 3, 2, 1 because term frequency in the first and second documents is higher than in the third document. We want results be displayed in the order as 3,2,1 since the third document has exact match.
>
> My request handler is as follows.
>
> <requestHandler name="search" class="solr.SearchHandler" >
> <lst name="defaults">
> <str name="defType">edismax</str>
> <str name="echoParams">explicit</str>
> <float name="tie">0.01</float>
> <str name="qf">host^30 content^0.5 title^1.2</str>
> <str name="pf">host^30 content^20 title^22 </str>
> <str name="fl">url,id, site ,title</str>
> <str name="mm">2<-1 5<-2 6<90%</str>
> <int name="ps">1</int>
> <bool name="hl">true</bool>
> <str name="q.alt">*:*</str>
> <str name="hl.fl">content</str>
> <str name="f.title.hl.fragsize">0</str>
> <str name="hl.fragsize">165</str>
> <str name="f.title.hl.alternateField">title</str>
> <str name="f.url.hl.fragsize">0</str>
> <str name="f.url.hl.alternateField">url</str>
> <str name="f.content.hl.fragmenter">regex</str>
> <str name="spellcheck">true</str>
> <str name="spellcheck.collate">true</str>
> <str name="spellcheck.count">5</str>
> <str name="group">true</str>
> <str name="group.field">site</str>
> <str name="group.ngroups">true</str>
> </lst>
> <arr name="last-components">
> <str>spellcheck</str>
> </arr>
> </requestHandler>
>
> Any ideas how to fix this issue?
>
> Thanks in advance.
> Alex.