You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Papalagi Pakeha <pa...@gmail.com> on 2007/11/06 06:05:12 UTC

Score of exact matches

Hi all,

I use Solr 1.2 on a job advertising site. I started from the default
setup that runs all documents and queries through
EnglishPorterFilterFactory. As a result for example an ad with
"accounts" in its title is matched when someone runs a query for
"accountant" because both are stemmed to the "account" word and then
they match.

Is it somehow possible to give a higher score to exact matches and
sort them before matches from stemmed terms?

Close to this is a problem with accents - I can remove accents from
both documents and from queries and then run the query on non-accented
terms. But I'd like to give higher score to documents where the search
term matches exactly (i.e. including accents and possibly letter
capitalization, etc) and sort them before more fuzzy searches.

To me it looks like I have to run multiple sub-queries for each query,
one for exact match, one for accents removed and one for stemmed words
and then combine the results and compute the final score for each
match. Is that possible?

Thanks!

PaPa

Re: Score of exact matches

Posted by Mike Klaas <mi...@gmail.com>.

On 5-Nov-07, at 9:05 PM, Papalagi Pakeha wrote:

> Hi all,
>
> I use Solr 1.2 on a job advertising site. I started from the default
> setup that runs all documents and queries through
> EnglishPorterFilterFactory. As a result for example an ad with
> "accounts" in its title is matched when someone runs a query for
> "accountant" because both are stemmed to the "account" word and then
> they match.
>
> Is it somehow possible to give a higher score to exact matches and
> sort them before matches from stemmed terms?
>
> Close to this is a problem with accents - I can remove accents from
> both documents and from queries and then run the query on non-accented
> terms. But I'd like to give higher score to documents where the search
> term matches exactly (i.e. including accents and possibly letter
> capitalization, etc) and sort them before more fuzzy searches.
>
> To me it looks like I have to run multiple sub-queries for each query,
> one for exact match, one for accents removed and one for stemmed words
> and then combine the results and compute the final score for each
> match. Is that possible?

One way to do this is to index both alternatives at every term  
position.  So when stemming, you'd store (account accountant)  
(account accounts), etc., when filtering, (epee épée) (fantome  
fantôme), etc.

Now when querying, transform your query into <canonicalized version>  
<original version>^10:

épée -> epee épée^10
accountant -> account accountant^10

A bit of work to do in general, though.

-Mike

Re: Score of exact matches

Posted by Papalagi Pakeha <pa...@gmail.com>.

On 11/6/07, Walter Underwood <wu...@netflix.com> wrote:
> This is fairly straightforward and works well with the DisMax
> handler. Indes the text into three different fields with three
> different sets of analyzers. Use something like this in the
> request handler:
> [...]
>     <str name="qf">
>           exact^16 noaccent^4 stemmed
>     </str>

Thanks, that's exactly what I needed. being new to Solr I didn't know
exactly how the filters and analyzers work together. With your hint I
leaned it all and now it works beautifully :-)

PaPa

RE: Score of exact matches

Posted by "Norskog, Lance" <la...@divvio.com>.

What is the performance profile of this against merely searching against
one field? My situation is millions of small records with an average of
200 bytes/text field.

Lance 

-----Original Message-----
From: Walter Underwood [mailto:wunderwood@netflix.com] 
Sent: Monday, November 05, 2007 9:38 PM
To: solr-user@lucene.apache.org
Subject: Re: Score of exact matches

This is fairly straightforward and works well with the DisMax handler.
Indes the text into three different fields with three different sets of
analyzers. Use something like this in the request handler:

 <requestHandler name="multimatch" class="solr.DisMaxRequestHandler" >
    <lst name="defaults">
     <float name="tie">0.01</float>
     <str name="qf">
           exact^16 noaccent^4 stemmed
     </str>
     <str name="pf">
           exact^16 noaccent^4 stemmed
     </str>
   </lst>
 </requestHandler>

You will probably need to adjust the weights for your content, though I
expect these are a good starting place.

Per-field analyzers are very easy to use in Solr and are extremely
powerful. I wish we'd thought of that in Ultraseek.

wunder
==
Search Guy, Netflix
Formerly: Architect, Ultraseek

On 11/5/07 9:05 PM, "Papalagi Pakeha" <pa...@gmail.com> wrote:

> Hi all,
> 
> I use Solr 1.2 on a job advertising site. I started from the default 
> setup that runs all documents and queries through 
> EnglishPorterFilterFactory. As a result for example an ad with 
> "accounts" in its title is matched when someone runs a query for 
> "accountant" because both are stemmed to the "account" word and then 
> they match.
> 
> Is it somehow possible to give a higher score to exact matches and 
> sort them before matches from stemmed terms?
> 
> Close to this is a problem with accents - I can remove accents from 
> both documents and from queries and then run the query on non-accented

> terms. But I'd like to give higher score to documents where the search

> term matches exactly (i.e. including accents and possibly letter 
> capitalization, etc) and sort them before more fuzzy searches.
> 
> To me it looks like I have to run multiple sub-queries for each query,

> one for exact match, one for accents removed and one for stemmed words

> and then combine the results and compute the final score for each 
> match. Is that possible?
> 
> Thanks!
> 
> PaPa

Re: Score of exact matches

Posted by Walter Underwood <wu...@netflix.com>.

This is fairly straightforward and works well with the DisMax
handler. Indes the text into three different fields with three
different sets of analyzers. Use something like this in the
request handler:

 <requestHandler name="multimatch" class="solr.DisMaxRequestHandler" >
    <lst name="defaults">
     <float name="tie">0.01</float>
     <str name="qf">
           exact^16 noaccent^4 stemmed
     </str>
     <str name="pf">
           exact^16 noaccent^4 stemmed
     </str>
   </lst>
 </requestHandler>

You will probably need to adjust the weights for your content,
though I expect these are a good starting place.

Per-field analyzers are very easy to use in Solr and are
extremely powerful. I wish we'd thought of that in Ultraseek.

wunder
==
Search Guy, Netflix
Formerly: Architect, Ultraseek

On 11/5/07 9:05 PM, "Papalagi Pakeha" <pa...@gmail.com> wrote:

> Hi all,
> 
> I use Solr 1.2 on a job advertising site. I started from the default
> setup that runs all documents and queries through
> EnglishPorterFilterFactory. As a result for example an ad with
> "accounts" in its title is matched when someone runs a query for
> "accountant" because both are stemmed to the "account" word and then
> they match.
> 
> Is it somehow possible to give a higher score to exact matches and
> sort them before matches from stemmed terms?
> 
> Close to this is a problem with accents - I can remove accents from
> both documents and from queries and then run the query on non-accented
> terms. But I'd like to give higher score to documents where the search
> term matches exactly (i.e. including accents and possibly letter
> capitalization, etc) and sort them before more fuzzy searches.
> 
> To me it looks like I have to run multiple sub-queries for each query,
> one for exact match, one for accents removed and one for stemmed words
> and then combine the results and compute the final score for each
> match. Is that possible?
> 
> Thanks!
> 
> PaPa