You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Brian Whitman <br...@echonest.com> on 2008/11/25 17:40:21 UTC

matching exact terms

This is probably severe user error, but I am curious about how to index docs
to make this query work:
happy birthday

to return the doc with n_name:"Happy Birthday" before the doc with
n_name:"Happy Birthday, Happy Birthday" . As it is now, the latter appears
first for a query of n_name:"happy birthday", the former second.

It would be great to do this at query time instead of having to re-index,
but I will if I have to!

The n_* type is defined as:

    <fieldtype name="name" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.ISOLatin1AccentFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.ISOLatin1AccentFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
    </fieldtype>

Re: matching exact terms

Posted by Ryan McKinley <ry...@gmail.com>.

On Nov 25, 2008, at 11:40 AM, Brian Whitman wrote:

> This is probably severe user error, but I am curious about how to  
> index docs
> to make this query work:
> happy birthday
>
> to return the doc with n_name:"Happy Birthday" before the doc with
> n_name:"Happy Birthday, Happy Birthday" . As it is now, the latter  
> appears
> first for a query of n_name:"happy birthday", the former second.
>
> It would be great to do this at query time instead of having to re- 
> index,
> but I will if I have to!
>
> The n_* type is defined as:
>
>    <fieldtype name="name" class="solr.TextField"
> positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.StandardTokenizerFactory"/>
>        <filter class="solr.StandardFilterFactory"/>
>    <filter class="solr.ISOLatin1AccentFilterFactory"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>      </analyzer>
>    <analyzer type="query">
>        <tokenizer class="solr.StandardTokenizerFactory"/>
>        <filter class="solr.StandardFilterFactory"/>
>    <filter class="solr.ISOLatin1AccentFilterFactory"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>    </analyzer>
>    </fieldtype>

Hi Brian!

what is the explain text when you turn on debugQuery=true?

With the indexing scheme you have, "happy birthday, happy birthday"  
will match 4 terms while "happy birthday" only two.

Two options come to mind (sorry, both require reindexing)

1. add the remove duplicates filter.  This would have both documents  
match only two terms, and the fieldNorm should boost the shorter field  
about the longer one.  However removing the duplicates may make some  
other queries less relevant.

2. add a copyField and index the name as a string or something without  
tokenization (use the KeywordTokenizerFactory)-- then query on both  
fields (dismax) and boost an exact match over text match:
   name_with_tokens^1 name_no_tokens^3 (or something like that)

ryan