You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Vijay Ramachandran <vi...@gmail.com> on 2011/10/16 19:28:53 UTC

help with phrase query

Hello. I have an application where I try to match longer queries (sentences)
to short documents (search phrases). Typically, the documents are 3-5 terms
in length. I am facing a problem where phrase match in the indicated phrase
fields via "pf" doesn't seem to match in most cases, and I am stumped.
Please help!

For instance, when my query is "should I buy a house now while the rates are
low. We filed BR 2 yrs ago. Rent now, w/ some sch loan debt"

I expect the document "buy a house" to match much higher than "house
loan rates".
However, the latter is the document which always matches higher.


I tried to do this the following way (solr 3.1):
1. Score phrase matches high
2. Score single word matches lower
3. Use dismax with a "mm" of 1, and very high boost for exact phrase match.

I used the s "text" definition in the schema for the single words, and the
following for the phrase:

    <fieldType name="shingle" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1"
        catenateWords="1" catenateNumbers="1" catenateAll="0"
splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
    <filter class="solr.ShingleFilterFactory" maxShingleSize="3"
outputUnigrams="false"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1"
        catenateWords="0" catenateNumbers="0" catenateAll="0"
splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
    <filter class="solr.ShingleFilterFactory" maxShingleSize="3"
outputUnigrams="false"/>
      </analyzer>
    </fieldType>

and my schema fields look like this:

   <field name="kw_stopped" type="text_en" indexed="true" omitNorms="True"
/>

   <!-- keywords almost as is - to provide truer match for full phrases -->
   <field name="kw_phrases" type="shingle" indexed="true" omitNorms="True"
/>

This is my search handler config:

  <requestHandler name="edismax" class="solr.SearchHandler" default="true">
    <lst name="defaults">
     <str name="defType">edismax</str>
     <str name="echoParams">explicit</str>
     <float name="tie">0.1</float>
     <str name="fl">
       kpid,advid,campaign,keywords
     </str>
     <str name="mm">1</str>
     <str name="qf">
       kw_stopped^1.0
     </str>
     <str name="pf">
       kw_phrases^50.0
     </str>
     <int name="ps">3</int>
     <int name="qs">3</int>
     <str name="q.alt">*:*</str>
     <!-- example highlighter config, enable per-query with hl=true -->
     <str name="hl.fl">keywords</str>
     <!-- for this field, we want no fragmenting, just highlighting -->
     <str name="f.name.hl.fragsize">0</str>
     <!-- instructs Solr to return the field itself if no query terms are
          found -->
     <str name="f.name.hl.alternateField">title</str>
     <str name="f.text.hl.fragmenter">regex</str> <!-- defined below -->
    </lst>
  </requestHandler>

These are the match score debugQuery explanations:

8.480054E-4 = (MATCH) sum of:
  8.480054E-4 = (MATCH) product of:
    0.0031093531 = (MATCH) sum of:
      0.0015556295 = (MATCH) weight(kw_stopped:hous in 1812), product of:
        2.8209004E-4 = queryWeight(kw_stopped:hous), product of:
          5.514656 = idf(docFreq=25, maxDocs=2375)
          5.1152787E-5 = queryNorm
        5.514656 = (MATCH) fieldWeight(kw_stopped:hous in 1812), product of:
          1.0 = tf(termFreq(kw_stopped:hous)=1)
          5.514656 = idf(docFreq=25, maxDocs=2375)
          1.0 = fieldNorm(field=kw_stopped, doc=1812)
      8.192911E-4 = (MATCH) weight(kw_stopped:rate in 1812), product of:
        2.0471694E-4 = queryWeight(kw_stopped:rate), product of:
          4.002068 = idf(docFreq=117, maxDocs=2375)
          5.1152787E-5 = queryNorm
        4.002068 = (MATCH) fieldWeight(kw_stopped:rate in 1812), product of:
          1.0 = tf(termFreq(kw_stopped:rate)=1)
          4.002068 = idf(docFreq=117, maxDocs=2375)
          1.0 = fieldNorm(field=kw_stopped, doc=1812)
      7.344327E-4 = (MATCH) weight(kw_stopped:loan in 1812), product of:
        1.9382538E-4 = queryWeight(kw_stopped:loan), product of:
          3.7891462 = idf(docFreq=145, maxDocs=2375)
          5.1152787E-5 = queryNorm
        3.7891462 = (MATCH) fieldWeight(kw_stopped:loan in 1812), product
of:
          1.0 = tf(termFreq(kw_stopped:loan)=1)
          3.7891462 = idf(docFreq=145, maxDocs=2375)
          1.0 = fieldNorm(field=kw_stopped, doc=1812)
    0.27272728 = coord(3/11)

for "house loan rates" vs

8.480054E-4 = (MATCH) sum of:
  8.480054E-4 = (MATCH) product of:
    0.0031093531 = (MATCH) sum of:
      0.0015556295 = (MATCH) weight(kw_stopped:hous in 1812), product of:
        2.8209004E-4 = queryWeight(kw_stopped:hous), product of:
          5.514656 = idf(docFreq=25, maxDocs=2375)
          5.1152787E-5 = queryNorm
        5.514656 = (MATCH) fieldWeight(kw_stopped:hous in 1812), product of:
          1.0 = tf(termFreq(kw_stopped:hous)=1)
          5.514656 = idf(docFreq=25, maxDocs=2375)
          1.0 = fieldNorm(field=kw_stopped, doc=1812)
      8.192911E-4 = (MATCH) weight(kw_stopped:rate in 1812), product of:
        2.0471694E-4 = queryWeight(kw_stopped:rate), product of:
          4.002068 = idf(docFreq=117, maxDocs=2375)
          5.1152787E-5 = queryNorm
        4.002068 = (MATCH) fieldWeight(kw_stopped:rate in 1812), product of:
          1.0 = tf(termFreq(kw_stopped:rate)=1)
          4.002068 = idf(docFreq=117, maxDocs=2375)
          1.0 = fieldNorm(field=kw_stopped, doc=1812)
      7.344327E-4 = (MATCH) weight(kw_stopped:loan in 1812), product of:
        1.9382538E-4 = queryWeight(kw_stopped:loan), product of:
          3.7891462 = idf(docFreq=145, maxDocs=2375)
          5.1152787E-5 = queryNorm
        3.7891462 = (MATCH) fieldWeight(kw_stopped:loan in 1812), product
of:
          1.0 = tf(termFreq(kw_stopped:loan)=1)
          3.7891462 = idf(docFreq=145, maxDocs=2375)
          1.0 = fieldNorm(field=kw_stopped, doc=1812)
    0.27272728 = coord(3/11)

for "buy a house".

Unless I try an exact phrase "buy a house" as the query, the kw_phrases
never shows up in the explanation.

What am I doing wrong? Please help!

thanks,
Vijay

Re: help with phrase query

Posted by elisabeth benoit <el...@gmail.com>.
I think you can use pf2 and pf3 in your requestHandler.

Best regards,
Elisabeth

2011/10/16 Vijay Ramachandran <vi...@gmail.com>

> Hello. I have an application where I try to match longer queries
> (sentences)
> to short documents (search phrases). Typically, the documents are 3-5 terms
> in length. I am facing a problem where phrase match in the indicated phrase
> fields via "pf" doesn't seem to match in most cases, and I am stumped.
> Please help!
>
> For instance, when my query is "should I buy a house now while the rates
> are
> low. We filed BR 2 yrs ago. Rent now, w/ some sch loan debt"
>
> I expect the document "buy a house" to match much higher than "house
> loan rates".
> However, the latter is the document which always matches higher.
>
>
> I tried to do this the following way (solr 3.1):
> 1. Score phrase matches high
> 2. Score single word matches lower
> 3. Use dismax with a "mm" of 1, and very high boost for exact phrase match.
>
> I used the s "text" definition in the schema for the single words, and the
> following for the phrase:
>
>    <fieldType name="shingle" class="solr.TextField"
> positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1"
>        catenateWords="1" catenateNumbers="1" catenateAll="0"
> splitOnCaseChange="1"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
>    <filter class="solr.ShingleFilterFactory" maxShingleSize="3"
> outputUnigrams="false"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1"
>        catenateWords="0" catenateNumbers="0" catenateAll="0"
> splitOnCaseChange="1"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
>    <filter class="solr.ShingleFilterFactory" maxShingleSize="3"
> outputUnigrams="false"/>
>      </analyzer>
>    </fieldType>
>
> and my schema fields look like this:
>
>   <field name="kw_stopped" type="text_en" indexed="true" omitNorms="True"
> />
>
>   <!-- keywords almost as is - to provide truer match for full phrases -->
>   <field name="kw_phrases" type="shingle" indexed="true" omitNorms="True"
> />
>
> This is my search handler config:
>
>  <requestHandler name="edismax" class="solr.SearchHandler" default="true">
>    <lst name="defaults">
>     <str name="defType">edismax</str>
>     <str name="echoParams">explicit</str>
>     <float name="tie">0.1</float>
>     <str name="fl">
>       kpid,advid,campaign,keywords
>     </str>
>     <str name="mm">1</str>
>     <str name="qf">
>       kw_stopped^1.0
>     </str>
>     <str name="pf">
>       kw_phrases^50.0
>     </str>
>     <int name="ps">3</int>
>     <int name="qs">3</int>
>     <str name="q.alt">*:*</str>
>     <!-- example highlighter config, enable per-query with hl=true -->
>     <str name="hl.fl">keywords</str>
>     <!-- for this field, we want no fragmenting, just highlighting -->
>     <str name="f.name.hl.fragsize">0</str>
>     <!-- instructs Solr to return the field itself if no query terms are
>          found -->
>     <str name="f.name.hl.alternateField">title</str>
>     <str name="f.text.hl.fragmenter">regex</str> <!-- defined below -->
>    </lst>
>  </requestHandler>
>
> These are the match score debugQuery explanations:
>
> 8.480054E-4 = (MATCH) sum of:
>  8.480054E-4 = (MATCH) product of:
>    0.0031093531 = (MATCH) sum of:
>      0.0015556295 = (MATCH) weight(kw_stopped:hous in 1812), product of:
>        2.8209004E-4 = queryWeight(kw_stopped:hous), product of:
>          5.514656 = idf(docFreq=25, maxDocs=2375)
>          5.1152787E-5 = queryNorm
>        5.514656 = (MATCH) fieldWeight(kw_stopped:hous in 1812), product of:
>          1.0 = tf(termFreq(kw_stopped:hous)=1)
>          5.514656 = idf(docFreq=25, maxDocs=2375)
>          1.0 = fieldNorm(field=kw_stopped, doc=1812)
>      8.192911E-4 = (MATCH) weight(kw_stopped:rate in 1812), product of:
>        2.0471694E-4 = queryWeight(kw_stopped:rate), product of:
>          4.002068 = idf(docFreq=117, maxDocs=2375)
>          5.1152787E-5 = queryNorm
>        4.002068 = (MATCH) fieldWeight(kw_stopped:rate in 1812), product of:
>          1.0 = tf(termFreq(kw_stopped:rate)=1)
>          4.002068 = idf(docFreq=117, maxDocs=2375)
>          1.0 = fieldNorm(field=kw_stopped, doc=1812)
>      7.344327E-4 = (MATCH) weight(kw_stopped:loan in 1812), product of:
>        1.9382538E-4 = queryWeight(kw_stopped:loan), product of:
>          3.7891462 = idf(docFreq=145, maxDocs=2375)
>          5.1152787E-5 = queryNorm
>        3.7891462 = (MATCH) fieldWeight(kw_stopped:loan in 1812), product
> of:
>          1.0 = tf(termFreq(kw_stopped:loan)=1)
>          3.7891462 = idf(docFreq=145, maxDocs=2375)
>          1.0 = fieldNorm(field=kw_stopped, doc=1812)
>    0.27272728 = coord(3/11)
>
> for "house loan rates" vs
>
> 8.480054E-4 = (MATCH) sum of:
>  8.480054E-4 = (MATCH) product of:
>    0.0031093531 = (MATCH) sum of:
>      0.0015556295 = (MATCH) weight(kw_stopped:hous in 1812), product of:
>        2.8209004E-4 = queryWeight(kw_stopped:hous), product of:
>          5.514656 = idf(docFreq=25, maxDocs=2375)
>          5.1152787E-5 = queryNorm
>        5.514656 = (MATCH) fieldWeight(kw_stopped:hous in 1812), product of:
>          1.0 = tf(termFreq(kw_stopped:hous)=1)
>          5.514656 = idf(docFreq=25, maxDocs=2375)
>          1.0 = fieldNorm(field=kw_stopped, doc=1812)
>      8.192911E-4 = (MATCH) weight(kw_stopped:rate in 1812), product of:
>        2.0471694E-4 = queryWeight(kw_stopped:rate), product of:
>          4.002068 = idf(docFreq=117, maxDocs=2375)
>          5.1152787E-5 = queryNorm
>        4.002068 = (MATCH) fieldWeight(kw_stopped:rate in 1812), product of:
>          1.0 = tf(termFreq(kw_stopped:rate)=1)
>          4.002068 = idf(docFreq=117, maxDocs=2375)
>          1.0 = fieldNorm(field=kw_stopped, doc=1812)
>      7.344327E-4 = (MATCH) weight(kw_stopped:loan in 1812), product of:
>        1.9382538E-4 = queryWeight(kw_stopped:loan), product of:
>          3.7891462 = idf(docFreq=145, maxDocs=2375)
>          5.1152787E-5 = queryNorm
>        3.7891462 = (MATCH) fieldWeight(kw_stopped:loan in 1812), product
> of:
>          1.0 = tf(termFreq(kw_stopped:loan)=1)
>          3.7891462 = idf(docFreq=145, maxDocs=2375)
>          1.0 = fieldNorm(field=kw_stopped, doc=1812)
>    0.27272728 = coord(3/11)
>
> for "buy a house".
>
> Unless I try an exact phrase "buy a house" as the query, the kw_phrases
> never shows up in the explanation.
>
> What am I doing wrong? Please help!
>
> thanks,
> Vijay
>