You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Stephen Duncan Jr <st...@gmail.com> on 2009/11/03 17:22:04 UTC

Customizing Field Score (Multivalued Field)

We've had a customized score calculator for one of our fields that we
developed when we were using Lucene instead of Solr (lucene 2.4).  During
our switch to Solr, we simply continued to use that code.  However, as the
version of Lucene used in Solr changed to 2.9, somewhere along the way
(unfortunately during our last release of code), that customization broke.
I'd previously tried to keep it up to date by dealing with deprecation
warnings, but managed to break things.  Now I'm pretty lost with regards to
that code.

Our customization is conceptually pretty simple, so rather than try to fix
up our code, I'd like some advice on the best way to implement this with
Solr 2.4 starting fresh.

We have a multi-valued field, where each value is basically the id of a
category.  Along with the id, there's a score for how well the document fit
into that category (between 0.0 and 1.0).  I'm looking for that
category-score to affect the score of documents when searching on that
field.  Any suggestions on the best way to attack this in Solr 2.4?

Here's how we did it in Lucene: we had an extension of Query, with a custom
scorer.   In the index we stored the category id's as single-valued
space-separated string.  We also stored a space-separated string of scores
in another field.  We made of these fields stored.  We simply delegated the
search to the normal searcher, then we calculated the score we retrieved the
values of both fields for the document.  Then we turned the space-separated
strings into arrays, searched the id array for the index of the desired id,
then scanned the score array for the matching score, and returned.

-- 
Stephen Duncan Jr
www.stephenduncanjr.com

Re: Customizing Field Score (Multivalued Field)

Posted by Stephen Duncan Jr <st...@gmail.com>.
On Thu, Nov 12, 2009 at 3:00 PM, Stephen Duncan Jr <stephen.duncan@gmail.com
> wrote:

> On Thu, Nov 12, 2009 at 2:54 PM, Chris Hostetter <hossman_lucene@fucit.org
> > wrote:
>
>>
>> oh man, so you were parsing the Stored field values of every matching doc
>> at query time? ouch.
>>
>> Assuming i'm understanding your goal, the conventional way to solve this
>> type of problem is "payloads" ... you'll find lots of discussion on it in
>> the various Lucene mailing lists, and if you look online Michael Busch has
>> various slides that talk about using them.  they let you say things
>> like "in this document, at this postion of field 'x' the word 'microsoft'
>> is worth 37.4, but at this other position (or in this other document)
>> 'microsoft' is only worth 17.2"
>>
>> The simplest way to use them in Solr (as i understand it) is to use
>> soemthing like the DelimitedPayloadTokenFilterFactory when indexing, and
>> then write yourself
>> a simple little custom QParser that generates a BoostingTermQuery on your
>> field.
>>
>> should be a lot simpler to implement then the Query you are describing,
>> and much faster.
>>
>>
>> -Hoss
>>
>>
> Thanks. I finally got around to looking at this again today and was looking
> at a similar path, so I appreciate the confirmation.
>
>
> --
> Stephen Duncan Jr
> www.stephenduncanjr.com
>

For posterity, here's the rest of what I discovered trying to implement
this:

You'll need to write a PayloadSimilarity as described here:
http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/(here's
my updated version due to deprecation of the method mentioned in
that article):

    @Override
    public float scorePayload(
        int docId,
        String fieldName,
        int start,
        int end,
        byte[] payload,
        int offset,
        int length)
    {
        // can ignore length here, because we know it is encoded as 4 bytes
        return PayloadHelper.decodeFloat(payload, offset);
    }

You'll need to register that similarity in your Solr schema.xml (was hard to
figure out, as I didn't realize that the similarity has to be applied
globally to the writer/search used generally, even though I only care about
payloads on one field, so I wasted time trying to figure out how to plug in
the similarity in my query parser).

You'll want to use the "payloads" type or something based on it that's in
the example schema.xml.

The latest and greatest query type to use is PayloadTermQuery.  I use it in
my custom query parser class, overriding getFieldQuery, checking for my
field name, and then:

 return new PayloadTermQuery(new Term(field, queryText),
                new AveragePayloadFunction());

Due to the global nature of the Similarity, I guess you'd have to modify it
to look at the field name and base behavior on that if you wanted different
kinds of payloads on different fields in one schema.

Also, whereas in my original implementation, I controlled the score
completely, and therefore if I set a score of 0.8, the doc came back as
score of 0.8, in this technique the payload is just used as a boost/addition
to the score, so my scores came out higher than before.  Since they're still
in the same relative order, that still satisfied my needs, but did require
updating my test cases.

-- 
Stephen Duncan Jr
www.stephenduncanjr.com

Re: Customizing Field Score (Multivalued Field)

Posted by Stephen Duncan Jr <st...@gmail.com>.
On Thu, Nov 12, 2009 at 2:54 PM, Chris Hostetter
<ho...@fucit.org>wrote:

>
> oh man, so you were parsing the Stored field values of every matching doc
> at query time? ouch.
>
> Assuming i'm understanding your goal, the conventional way to solve this
> type of problem is "payloads" ... you'll find lots of discussion on it in
> the various Lucene mailing lists, and if you look online Michael Busch has
> various slides that talk about using them.  they let you say things
> like "in this document, at this postion of field 'x' the word 'microsoft'
> is worth 37.4, but at this other position (or in this other document)
> 'microsoft' is only worth 17.2"
>
> The simplest way to use them in Solr (as i understand it) is to use
> soemthing like the DelimitedPayloadTokenFilterFactory when indexing, and
> then write yourself
> a simple little custom QParser that generates a BoostingTermQuery on your
> field.
>
> should be a lot simpler to implement then the Query you are describing,
> and much faster.
>
>
> -Hoss
>
>
Thanks. I finally got around to looking at this again today and was looking
at a similar path, so I appreciate the confirmation.

-- 
Stephen Duncan Jr
www.stephenduncanjr.com

Re: Customizing Field Score (Multivalued Field)

Posted by Chris Hostetter <ho...@fucit.org>.
: Here's how we did it in Lucene: we had an extension of Query, with a custom
: scorer.   In the index we stored the category id's as single-valued
: space-separated string.  We also stored a space-separated string of scores
: in another field.  We made of these fields stored.  We simply delegated the
: search to the normal searcher, then we calculated the score we retrieved the
: values of both fields for the document.  Then we turned the space-separated
: strings into arrays, searched the id array for the index of the desired id,
: then scanned the score array for the matching score, and returned.

oh man, so you were parsing the Stored field values of every matching doc 
at query time? ouch.

Assuming i'm understanding your goal, the conventional way to solve this 
type of problem is "payloads" ... you'll find lots of discussion on it in 
the various Lucene mailing lists, and if you look online Michael Busch has 
various slides that talk about using them.  they let you say things 
like "in this document, at this postion of field 'x' the word 'microsoft' 
is worth 37.4, but at this other position (or in this other document) 
'microsoft' is only worth 17.2"

The simplest way to use them in Solr (as i understand it) is to use 
soemthing like the DelimitedPayloadTokenFilterFactory when indexing, and then write yourself 
a simple little custom QParser that generates a BoostingTermQuery on your 
field.

should be a lot simpler to implement then the Query you are describing, 
and much faster.


-Hoss