You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Vijay Ramachandran <vi...@gmail.com> on 2011/10/16 19:28:53 UTC
help with phrase query
Hello. I have an application where I try to match longer queries (sentences)
to short documents (search phrases). Typically, the documents are 3-5 terms
in length. I am facing a problem where phrase match in the indicated phrase
fields via "pf" doesn't seem to match in most cases, and I am stumped.
Please help!
For instance, when my query is "should I buy a house now while the rates are
low. We filed BR 2 yrs ago. Rent now, w/ some sch loan debt"
I expect the document "buy a house" to match much higher than "house
loan rates".
However, the latter is the document which always matches higher.
I tried to do this the following way (solr 3.1):
1. Score phrase matches high
2. Score single word matches lower
3. Use dismax with a "mm" of 1, and very high boost for exact phrase match.
I used the s "text" definition in the schema for the single words, and the
following for the phrase:
<fieldType name="shingle" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1"
catenateWords="1" catenateNumbers="1" catenateAll="0"
splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="3"
outputUnigrams="false"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1"
catenateWords="0" catenateNumbers="0" catenateAll="0"
splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="3"
outputUnigrams="false"/>
</analyzer>
</fieldType>
and my schema fields look like this:
<field name="kw_stopped" type="text_en" indexed="true" omitNorms="True"
/>
<!-- keywords almost as is - to provide truer match for full phrases -->
<field name="kw_phrases" type="shingle" indexed="true" omitNorms="True"
/>
This is my search handler config:
<requestHandler name="edismax" class="solr.SearchHandler" default="true">
<lst name="defaults">
<str name="defType">edismax</str>
<str name="echoParams">explicit</str>
<float name="tie">0.1</float>
<str name="fl">
kpid,advid,campaign,keywords
</str>
<str name="mm">1</str>
<str name="qf">
kw_stopped^1.0
</str>
<str name="pf">
kw_phrases^50.0
</str>
<int name="ps">3</int>
<int name="qs">3</int>
<str name="q.alt">*:*</str>
<!-- example highlighter config, enable per-query with hl=true -->
<str name="hl.fl">keywords</str>
<!-- for this field, we want no fragmenting, just highlighting -->
<str name="f.name.hl.fragsize">0</str>
<!-- instructs Solr to return the field itself if no query terms are
found -->
<str name="f.name.hl.alternateField">title</str>
<str name="f.text.hl.fragmenter">regex</str> <!-- defined below -->
</lst>
</requestHandler>
These are the match score debugQuery explanations:
8.480054E-4 = (MATCH) sum of:
8.480054E-4 = (MATCH) product of:
0.0031093531 = (MATCH) sum of:
0.0015556295 = (MATCH) weight(kw_stopped:hous in 1812), product of:
2.8209004E-4 = queryWeight(kw_stopped:hous), product of:
5.514656 = idf(docFreq=25, maxDocs=2375)
5.1152787E-5 = queryNorm
5.514656 = (MATCH) fieldWeight(kw_stopped:hous in 1812), product of:
1.0 = tf(termFreq(kw_stopped:hous)=1)
5.514656 = idf(docFreq=25, maxDocs=2375)
1.0 = fieldNorm(field=kw_stopped, doc=1812)
8.192911E-4 = (MATCH) weight(kw_stopped:rate in 1812), product of:
2.0471694E-4 = queryWeight(kw_stopped:rate), product of:
4.002068 = idf(docFreq=117, maxDocs=2375)
5.1152787E-5 = queryNorm
4.002068 = (MATCH) fieldWeight(kw_stopped:rate in 1812), product of:
1.0 = tf(termFreq(kw_stopped:rate)=1)
4.002068 = idf(docFreq=117, maxDocs=2375)
1.0 = fieldNorm(field=kw_stopped, doc=1812)
7.344327E-4 = (MATCH) weight(kw_stopped:loan in 1812), product of:
1.9382538E-4 = queryWeight(kw_stopped:loan), product of:
3.7891462 = idf(docFreq=145, maxDocs=2375)
5.1152787E-5 = queryNorm
3.7891462 = (MATCH) fieldWeight(kw_stopped:loan in 1812), product
of:
1.0 = tf(termFreq(kw_stopped:loan)=1)
3.7891462 = idf(docFreq=145, maxDocs=2375)
1.0 = fieldNorm(field=kw_stopped, doc=1812)
0.27272728 = coord(3/11)
for "house loan rates" vs
8.480054E-4 = (MATCH) sum of:
8.480054E-4 = (MATCH) product of:
0.0031093531 = (MATCH) sum of:
0.0015556295 = (MATCH) weight(kw_stopped:hous in 1812), product of:
2.8209004E-4 = queryWeight(kw_stopped:hous), product of:
5.514656 = idf(docFreq=25, maxDocs=2375)
5.1152787E-5 = queryNorm
5.514656 = (MATCH) fieldWeight(kw_stopped:hous in 1812), product of:
1.0 = tf(termFreq(kw_stopped:hous)=1)
5.514656 = idf(docFreq=25, maxDocs=2375)
1.0 = fieldNorm(field=kw_stopped, doc=1812)
8.192911E-4 = (MATCH) weight(kw_stopped:rate in 1812), product of:
2.0471694E-4 = queryWeight(kw_stopped:rate), product of:
4.002068 = idf(docFreq=117, maxDocs=2375)
5.1152787E-5 = queryNorm
4.002068 = (MATCH) fieldWeight(kw_stopped:rate in 1812), product of:
1.0 = tf(termFreq(kw_stopped:rate)=1)
4.002068 = idf(docFreq=117, maxDocs=2375)
1.0 = fieldNorm(field=kw_stopped, doc=1812)
7.344327E-4 = (MATCH) weight(kw_stopped:loan in 1812), product of:
1.9382538E-4 = queryWeight(kw_stopped:loan), product of:
3.7891462 = idf(docFreq=145, maxDocs=2375)
5.1152787E-5 = queryNorm
3.7891462 = (MATCH) fieldWeight(kw_stopped:loan in 1812), product
of:
1.0 = tf(termFreq(kw_stopped:loan)=1)
3.7891462 = idf(docFreq=145, maxDocs=2375)
1.0 = fieldNorm(field=kw_stopped, doc=1812)
0.27272728 = coord(3/11)
for "buy a house".
Unless I try an exact phrase "buy a house" as the query, the kw_phrases
never shows up in the explanation.
What am I doing wrong? Please help!
thanks,
Vijay
Re: help with phrase query
Posted by elisabeth benoit <el...@gmail.com>.
I think you can use pf2 and pf3 in your requestHandler.
Best regards,
Elisabeth
2011/10/16 Vijay Ramachandran <vi...@gmail.com>
> Hello. I have an application where I try to match longer queries
> (sentences)
> to short documents (search phrases). Typically, the documents are 3-5 terms
> in length. I am facing a problem where phrase match in the indicated phrase
> fields via "pf" doesn't seem to match in most cases, and I am stumped.
> Please help!
>
> For instance, when my query is "should I buy a house now while the rates
> are
> low. We filed BR 2 yrs ago. Rent now, w/ some sch loan debt"
>
> I expect the document "buy a house" to match much higher than "house
> loan rates".
> However, the latter is the document which always matches higher.
>
>
> I tried to do this the following way (solr 3.1):
> 1. Score phrase matches high
> 2. Score single word matches lower
> 3. Use dismax with a "mm" of 1, and very high boost for exact phrase match.
>
> I used the s "text" definition in the schema for the single words, and the
> following for the phrase:
>
> <fieldType name="shingle" class="solr.TextField"
> positionIncrementGap="100">
> <analyzer type="index">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1"
> catenateWords="1" catenateNumbers="1" catenateAll="0"
> splitOnCaseChange="1"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
> <filter class="solr.ShingleFilterFactory" maxShingleSize="3"
> outputUnigrams="false"/>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
> <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1"
> catenateWords="0" catenateNumbers="0" catenateAll="0"
> splitOnCaseChange="1"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
> <filter class="solr.ShingleFilterFactory" maxShingleSize="3"
> outputUnigrams="false"/>
> </analyzer>
> </fieldType>
>
> and my schema fields look like this:
>
> <field name="kw_stopped" type="text_en" indexed="true" omitNorms="True"
> />
>
> <!-- keywords almost as is - to provide truer match for full phrases -->
> <field name="kw_phrases" type="shingle" indexed="true" omitNorms="True"
> />
>
> This is my search handler config:
>
> <requestHandler name="edismax" class="solr.SearchHandler" default="true">
> <lst name="defaults">
> <str name="defType">edismax</str>
> <str name="echoParams">explicit</str>
> <float name="tie">0.1</float>
> <str name="fl">
> kpid,advid,campaign,keywords
> </str>
> <str name="mm">1</str>
> <str name="qf">
> kw_stopped^1.0
> </str>
> <str name="pf">
> kw_phrases^50.0
> </str>
> <int name="ps">3</int>
> <int name="qs">3</int>
> <str name="q.alt">*:*</str>
> <!-- example highlighter config, enable per-query with hl=true -->
> <str name="hl.fl">keywords</str>
> <!-- for this field, we want no fragmenting, just highlighting -->
> <str name="f.name.hl.fragsize">0</str>
> <!-- instructs Solr to return the field itself if no query terms are
> found -->
> <str name="f.name.hl.alternateField">title</str>
> <str name="f.text.hl.fragmenter">regex</str> <!-- defined below -->
> </lst>
> </requestHandler>
>
> These are the match score debugQuery explanations:
>
> 8.480054E-4 = (MATCH) sum of:
> 8.480054E-4 = (MATCH) product of:
> 0.0031093531 = (MATCH) sum of:
> 0.0015556295 = (MATCH) weight(kw_stopped:hous in 1812), product of:
> 2.8209004E-4 = queryWeight(kw_stopped:hous), product of:
> 5.514656 = idf(docFreq=25, maxDocs=2375)
> 5.1152787E-5 = queryNorm
> 5.514656 = (MATCH) fieldWeight(kw_stopped:hous in 1812), product of:
> 1.0 = tf(termFreq(kw_stopped:hous)=1)
> 5.514656 = idf(docFreq=25, maxDocs=2375)
> 1.0 = fieldNorm(field=kw_stopped, doc=1812)
> 8.192911E-4 = (MATCH) weight(kw_stopped:rate in 1812), product of:
> 2.0471694E-4 = queryWeight(kw_stopped:rate), product of:
> 4.002068 = idf(docFreq=117, maxDocs=2375)
> 5.1152787E-5 = queryNorm
> 4.002068 = (MATCH) fieldWeight(kw_stopped:rate in 1812), product of:
> 1.0 = tf(termFreq(kw_stopped:rate)=1)
> 4.002068 = idf(docFreq=117, maxDocs=2375)
> 1.0 = fieldNorm(field=kw_stopped, doc=1812)
> 7.344327E-4 = (MATCH) weight(kw_stopped:loan in 1812), product of:
> 1.9382538E-4 = queryWeight(kw_stopped:loan), product of:
> 3.7891462 = idf(docFreq=145, maxDocs=2375)
> 5.1152787E-5 = queryNorm
> 3.7891462 = (MATCH) fieldWeight(kw_stopped:loan in 1812), product
> of:
> 1.0 = tf(termFreq(kw_stopped:loan)=1)
> 3.7891462 = idf(docFreq=145, maxDocs=2375)
> 1.0 = fieldNorm(field=kw_stopped, doc=1812)
> 0.27272728 = coord(3/11)
>
> for "house loan rates" vs
>
> 8.480054E-4 = (MATCH) sum of:
> 8.480054E-4 = (MATCH) product of:
> 0.0031093531 = (MATCH) sum of:
> 0.0015556295 = (MATCH) weight(kw_stopped:hous in 1812), product of:
> 2.8209004E-4 = queryWeight(kw_stopped:hous), product of:
> 5.514656 = idf(docFreq=25, maxDocs=2375)
> 5.1152787E-5 = queryNorm
> 5.514656 = (MATCH) fieldWeight(kw_stopped:hous in 1812), product of:
> 1.0 = tf(termFreq(kw_stopped:hous)=1)
> 5.514656 = idf(docFreq=25, maxDocs=2375)
> 1.0 = fieldNorm(field=kw_stopped, doc=1812)
> 8.192911E-4 = (MATCH) weight(kw_stopped:rate in 1812), product of:
> 2.0471694E-4 = queryWeight(kw_stopped:rate), product of:
> 4.002068 = idf(docFreq=117, maxDocs=2375)
> 5.1152787E-5 = queryNorm
> 4.002068 = (MATCH) fieldWeight(kw_stopped:rate in 1812), product of:
> 1.0 = tf(termFreq(kw_stopped:rate)=1)
> 4.002068 = idf(docFreq=117, maxDocs=2375)
> 1.0 = fieldNorm(field=kw_stopped, doc=1812)
> 7.344327E-4 = (MATCH) weight(kw_stopped:loan in 1812), product of:
> 1.9382538E-4 = queryWeight(kw_stopped:loan), product of:
> 3.7891462 = idf(docFreq=145, maxDocs=2375)
> 5.1152787E-5 = queryNorm
> 3.7891462 = (MATCH) fieldWeight(kw_stopped:loan in 1812), product
> of:
> 1.0 = tf(termFreq(kw_stopped:loan)=1)
> 3.7891462 = idf(docFreq=145, maxDocs=2375)
> 1.0 = fieldNorm(field=kw_stopped, doc=1812)
> 0.27272728 = coord(3/11)
>
> for "buy a house".
>
> Unless I try an exact phrase "buy a house" as the query, the kw_phrases
> never shows up in the explanation.
>
> What am I doing wrong? Please help!
>
> thanks,
> Vijay
>