You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Catala, Francois" <Fr...@nuance.com> on 2013/03/18 23:21:00 UTC

Shingles Filter Query time behaviour

Hello,

I am trying to have the input "darkknight" match documents containing either "dark knight" and "darkknight".
The reverse should also work ("dark knight" matching "dark knight" and "darkknight") but it doesn't. Does anyone know why?


When I run the following query I get the expected response with the two documents matched

<lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">1</int>
  <lst name="params">
    <str name="fl">name</str>
    <str name="indent">true</str>
    <str name="q">name:darkknight</str>
    <str name="wt">xml</str>
  </lst>
</lst>
<result name="response" numFound="2" start="0">
  <doc>
    <str name="name">Batman, the darkknight Rises</str></doc>
  <doc>
    <str name="name">Batman, the dark knight Rises</str></doc>
</result>
</response>


HOWEVER when I run the same query looking for "dark knight" two words I get only 1 document matched as shows the response :

<lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">0</int>
  <lst name="params">
    <str name="fl">name</str>
    <str name="indent">true</str>
    <str name="q">name:dark knight</str>
    <str name="wt">xml</str>
  </lst>
</lst>
<result name="response" numFound="1" start="0">
  <doc>
    <str name="name">Batman, the dark knight Rises</str></doc>
</result>
</response>

I have these documents as input :

<doc>
  <field name="id">bat1</field>
  <field name="name">Batman, the dark knight Rises</field>
</doc>
<doc>
  <field name="id">bat2</field>
  <field name="name">Batman, the darkknight Rises</field>
</doc>

And I defined this analyser :

      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ShingleFilterFactory"
                tokenSeparator=""
                outputUnigrams="true"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ShingleFilterFactory"
                tokenSeparator=""
                outputUnigrams="true"
                outputUnigramIfNoNgrams="true"/>
      </analyzer>

Re: Shingles Filter Query time behaviour

Posted by Jack Krupansky <ja...@basetechnology.com>.
Or, q=name:(dark knight) .

-- Jack Krupansky

-----Original Message----- 
From: Otis Gospodnetic
Sent: Monday, March 25, 2013 11:51 PM
To: solr-user@lucene.apache.org
Subject: Re: Shingles Filter Query time behaviour

Hi,

What does your query look like?  Does it look like q=name:dark knight?
If so, note that only "dark" is going against the "name" field.  Try
q=name:dark name:knight or q=name:"dark knight".

Otis
--
Solr & ElasticSearch Support
http://sematext.com/





On Mon, Mar 18, 2013 at 6:21 PM, Catala, Francois
<Fr...@nuance.com> wrote:
> Hello,
>
> I am trying to have the input "darkknight" match documents containing 
> either "dark knight" and "darkknight".
> The reverse should also work ("dark knight" matching "dark knight" and 
> "darkknight") but it doesn't. Does anyone know why?
>
>
> When I run the following query I get the expected response with the two 
> documents matched
>
> <lst name="responseHeader">
>   <int name="status">0</int>
>   <int name="QTime">1</int>
>   <lst name="params">
>     <str name="fl">name</str>
>     <str name="indent">true</str>
>     <str name="q">name:darkknight</str>
>     <str name="wt">xml</str>
>   </lst>
> </lst>
> <result name="response" numFound="2" start="0">
>   <doc>
>     <str name="name">Batman, the darkknight Rises</str></doc>
>   <doc>
>     <str name="name">Batman, the dark knight Rises</str></doc>
> </result>
> </response>
>
>
> HOWEVER when I run the same query looking for "dark knight" two words I 
> get only 1 document matched as shows the response :
>
> <lst name="responseHeader">
>   <int name="status">0</int>
>   <int name="QTime">0</int>
>   <lst name="params">
>     <str name="fl">name</str>
>     <str name="indent">true</str>
>     <str name="q">name:dark knight</str>
>     <str name="wt">xml</str>
>   </lst>
> </lst>
> <result name="response" numFound="1" start="0">
>   <doc>
>     <str name="name">Batman, the dark knight Rises</str></doc>
> </result>
> </response>
>
> I have these documents as input :
>
> <doc>
>   <field name="id">bat1</field>
>   <field name="name">Batman, the dark knight Rises</field>
> </doc>
> <doc>
>   <field name="id">bat2</field>
>   <field name="name">Batman, the darkknight Rises</field>
> </doc>
>
> And I defined this analyser :
>
>       <analyzer type="index">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.ShingleFilterFactory"
>                 tokenSeparator=""
>                 outputUnigrams="true"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.ShingleFilterFactory"
>                 tokenSeparator=""
>                 outputUnigrams="true"
>                 outputUnigramIfNoNgrams="true"/>
>       </analyzer> 


Re: Shingles Filter Query time behaviour

Posted by Otis Gospodnetic <ot...@gmail.com>.
Hi,

What does your query look like?  Does it look like q=name:dark knight?
 If so, note that only "dark" is going against the "name" field.  Try
q=name:dark name:knight or q=name:"dark knight".

Otis
--
Solr & ElasticSearch Support
http://sematext.com/





On Mon, Mar 18, 2013 at 6:21 PM, Catala, Francois
<Fr...@nuance.com> wrote:
> Hello,
>
> I am trying to have the input "darkknight" match documents containing either "dark knight" and "darkknight".
> The reverse should also work ("dark knight" matching "dark knight" and "darkknight") but it doesn't. Does anyone know why?
>
>
> When I run the following query I get the expected response with the two documents matched
>
> <lst name="responseHeader">
>   <int name="status">0</int>
>   <int name="QTime">1</int>
>   <lst name="params">
>     <str name="fl">name</str>
>     <str name="indent">true</str>
>     <str name="q">name:darkknight</str>
>     <str name="wt">xml</str>
>   </lst>
> </lst>
> <result name="response" numFound="2" start="0">
>   <doc>
>     <str name="name">Batman, the darkknight Rises</str></doc>
>   <doc>
>     <str name="name">Batman, the dark knight Rises</str></doc>
> </result>
> </response>
>
>
> HOWEVER when I run the same query looking for "dark knight" two words I get only 1 document matched as shows the response :
>
> <lst name="responseHeader">
>   <int name="status">0</int>
>   <int name="QTime">0</int>
>   <lst name="params">
>     <str name="fl">name</str>
>     <str name="indent">true</str>
>     <str name="q">name:dark knight</str>
>     <str name="wt">xml</str>
>   </lst>
> </lst>
> <result name="response" numFound="1" start="0">
>   <doc>
>     <str name="name">Batman, the dark knight Rises</str></doc>
> </result>
> </response>
>
> I have these documents as input :
>
> <doc>
>   <field name="id">bat1</field>
>   <field name="name">Batman, the dark knight Rises</field>
> </doc>
> <doc>
>   <field name="id">bat2</field>
>   <field name="name">Batman, the darkknight Rises</field>
> </doc>
>
> And I defined this analyser :
>
>       <analyzer type="index">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.ShingleFilterFactory"
>                 tokenSeparator=""
>                 outputUnigrams="true"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.ShingleFilterFactory"
>                 tokenSeparator=""
>                 outputUnigrams="true"
>                 outputUnigramIfNoNgrams="true"/>
>       </analyzer>