You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Amit Nithian <an...@gmail.com> on 2013/11/13 00:46:04 UTC

Boosting documents by categorical preferences

Hi all,

I have a question around boosting. I wanted to use the &boost= to write a
nested query that will boost a document based on categorical preferences.

For a movie search for example, say that a user likes drama, comedy, and
action. I could use things like

qq=&q={!boost%20b=$b%20defType=edismax%20v=$qq}&b=sum(product(query($cat1),1.482),product(query($cat2),0.1199),product(query($cat3),1.448))&cat1=category:Drama&cat2=category:Comedy&cat3=category:Action

where cat1=Drama cat2=Comedy cat3=Action

Currently I have the weights set to the z-score equivalent of a user's
preference for that category which is simply how many standard deviations
above the global average is this user's preference for that movie category.

My question though is basically whether or not semantically the equation
query(category:Drama)*<some weight> + query(category:Comedy)*<some weight>
+ query(category:Action)*<some weight> makes sense?

What are some techniques people use to boost documents based on discrete
things like category, manufacturer, genre etc?

Thanks!
Amit

Re: Boosting documents by categorical preferences

Posted by Amit Nithian <an...@gmail.com>.

Chris,

Sounds good! Thanks for the tips.. I'll be glad to submit my talk to this
as I have a writeup pretty much ready to go.

Cheers
Amit


On Tue, Jan 28, 2014 at 11:24 AM, Chris Hostetter
<ho...@fucit.org>wrote:

>
> : The initial results seem to be kinda promising... of course there are
> many
> : more optimizations I could do like decay user ratings over time to
> indicate
> : that preferences decay over time so a 5 rating a year ago doesn't count
> as
> : much as a 5 rating today.
> :
> : Hope this helps others. I'll open source what I have soon and post back.
> If
> : there is feedback or other thoughts let me know!
>
> Hey Amit,
>
> Glad to hear your user based boosting experiments are paying off.  I would
> definitely love to see a more detailed writeup down the road showing off
> how it affects your final user metrics -- or perhaps even give a session
> on your technique at ApacheCon?
>
>
> http://events.linuxfoundation.org/events/apachecon-north-america/program/cfp
>
>
> -Hoss
> http://www.lucidworks.com/
>

Re: Boosting documents by categorical preferences

Posted by Chris Hostetter <ho...@fucit.org>.

: The initial results seem to be kinda promising... of course there are many
: more optimizations I could do like decay user ratings over time to indicate
: that preferences decay over time so a 5 rating a year ago doesn't count as
: much as a 5 rating today.
: 
: Hope this helps others. I'll open source what I have soon and post back. If
: there is feedback or other thoughts let me know!

Hey Amit,

Glad to hear your user based boosting experiments are paying off.  I would 
definitely love to see a more detailed writeup down the road showing off 
how it affects your final user metrics -- or perhaps even give a session 
on your technique at ApacheCon?

http://events.linuxfoundation.org/events/apachecon-north-america/program/cfp


-Hoss
http://www.lucidworks.com/

Re: Boosting documents by categorical preferences

Posted by Amit Nithian <an...@gmail.com>.

Hi Chris (and others interested in this),

Sorry for dropping off.. I got sidetracked with other work and came back to
this and finally got a V1 of this implemented.

The final process is as follows:
1) Pre-compute the global categorical num_ratings/average/std-dev (so for
Action the average rating may be 3.49 with stdDev of .99)
2) For a given user, retrieve the last X (X for me is 10) ratings and
compute the user's categorical affinities by taking the average rating for
all movies in that particular category (Action) subtract the global cat
average and divide by cat std_dev. Furthermore, multiply this by the
fraction of total user ratings in that category.
   -> For example, if a user's last 10 ratings consisted of 9/10 Drama and
1/10 Thriller, the z-score of the Thriller should be discounted relative to
that of the Drama so that it's more prominent the user's preference (either
positive or negative) to Drama.
3) Sort by the absolute value of the z-score (Thanks Hossman.. great
thought).
4) Return the top 3 (arbitrary number)
5) Modify the query to look like the following:

qq=tom hanks&q={!boost b=$b defType=edismax
v=$qq}&cat1=category:Children&cat2=category:Fantasy&cat3=category:Animation&b=sum(1,sum(product(query($cat1),0.22267872),product(query($cat2),0.21630952),product(query($cat3),0.21120241)))

basically b = 1+(pref1*query(category:something1) +
pref2*query(category:something2) + pref3*query(category:something3))

The initial results seem to be kinda promising... of course there are many
more optimizations I could do like decay user ratings over time to indicate
that preferences decay over time so a 5 rating a year ago doesn't count as
much as a 5 rating today.

Hope this helps others. I'll open source what I have soon and post back. If
there is feedback or other thoughts let me know!

Cheers
Amit


On Fri, Nov 22, 2013 at 11:38 AM, Chris Hostetter
<ho...@fucit.org>wrote:

>
> : I thought about that but my concern/question was how. If I used the pow
> : function then I'm still boosting the bad categories by a small
> : amount..alternatively I could multiply by a negative number but does that
> : work as expected?
>
> I'm not sure i understand your concern: negative powers would give you
> values less then 1, positive powers would give you values greater then 1,
> and then you'd use those values as multiplicitive boosts -- so the values
> less then 1 would penalize the scores of existing matching docs in the
> categories the user dislikes.
>
> Oh wait ... i see, in your original email (and in my subsequent suggested
> tweak to use pow()) you were talking about sum()ing up these 3 category
> boosts (and i cut/pasted sum() in my example as well) ... yeah,
> using multiplcation there would make more sense if you wanted to do the
> "negative prefrences" as well, because then then score of any matching doc
> will be reduced if it matches on an "undesired" category -- and the
> amount it will be reduced will be determined by how strongly it
> matches on that category (ie: the base score returned by the nested
> query() func) and "how negative" the undesired prefrence value (ie:
> the pow() exponent) is
>
>
> qq=...
> q={!boost b=$b v=$qq}
>
> b=prod(pow(query($cat1,cat1z)),pow(query($cat2,cat2z)),pow(query($cat3,cat3z))
> cat1=...action...
> cat1z=1.48
> cat2=...comedy...
> cat2z=1.33
> cat3=...kids...
> cat3z=-1.7
>
>
> -Hoss
>

Re: Boosting documents by categorical preferences

Posted by Chris Hostetter <ho...@fucit.org>.

: I thought about that but my concern/question was how. If I used the pow
: function then I'm still boosting the bad categories by a small
: amount..alternatively I could multiply by a negative number but does that
: work as expected?

I'm not sure i understand your concern: negative powers would give you 
values less then 1, positive powers would give you values greater then 1, 
and then you'd use those values as multiplicitive boosts -- so the values 
less then 1 would penalize the scores of existing matching docs in the 
categories the user dislikes.

Oh wait ... i see, in your original email (and in my subsequent suggested 
tweak to use pow()) you were talking about sum()ing up these 3 category 
boosts (and i cut/pasted sum() in my example as well) ... yeah, 
using multiplcation there would make more sense if you wanted to do the 
"negative prefrences" as well, because then then score of any matching doc 
will be reduced if it matches on an "undesired" category -- and the 
amount it will be reduced will be determined by how strongly it 
matches on that category (ie: the base score returned by the nested 
query() func) and "how negative" the undesired prefrence value (ie: 
the pow() exponent) is


qq=...
q={!boost b=$b v=$qq}
b=prod(pow(query($cat1,cat1z)),pow(query($cat2,cat2z)),pow(query($cat3,cat3z))
cat1=...action...
cat1z=1.48
cat2=...comedy...
cat2z=1.33
cat3=...kids...
cat3z=-1.7


-Hoss

Re: Boosting documents by categorical preferences

Posted by Amit Nithian <an...@gmail.com>.

I thought about that but my concern/question was how. If I used the pow
function then I'm still boosting the bad categories by a small
amount..alternatively I could multiply by a negative number but does that
work as expected?

I haven't done much with negative boosting except for the sledgehammer
approach of category exclusion through filters.

Thanks
Amit
On Nov 19, 2013 8:51 AM, "Chris Hostetter" <ho...@fucit.org> wrote:

> : My approach was something like:
> : 1) Look at the categories that the user has preferred and compute the
> : z-score
> : 2) Pick the top 3 among those
> : 3) Use those to boost search results.
>
> I think that totaly makes sense ... the additional bit i was suggesting
> that you consider is that instead of picking the "highest" 3 z-scores,
> pick the z-scores with the greatest absolute value ... that way if someone
> is a very booring person and their "positive interests" are all basically
> exactly the same as the mean for everyone else, but they have some very
> strong "dis-interests" you don't bother boosting on those miniscule
> interests and instead you negatively boost on the things they are
> antogonistic against.
>
>
> -Hoss
>

Re: Boosting documents by categorical preferences

Posted by Chris Hostetter <ho...@fucit.org>.

: My approach was something like:
: 1) Look at the categories that the user has preferred and compute the
: z-score
: 2) Pick the top 3 among those
: 3) Use those to boost search results.

I think that totaly makes sense ... the additional bit i was suggesting 
that you consider is that instead of picking the "highest" 3 z-scores, 
pick the z-scores with the greatest absolute value ... that way if someone 
is a very booring person and their "positive interests" are all basically 
exactly the same as the mean for everyone else, but they have some very 
strong "dis-interests" you don't bother boosting on those miniscule 
interests and instead you negatively boost on the things they are 
antogonistic against.


-Hoss

Re: Boosting documents by categorical preferences

Posted by Amit Nithian <an...@gmail.com>.

Hey Chris,

Sorry for the delay and thanks for your response. This was inspired by your
talk on boosting and biasing that you presented way back when at a meetup.
I'm glad that my general approach seems to make sense.

My approach was something like:
1) Look at the categories that the user has preferred and compute the
z-score
2) Pick the top 3 among those
3) Use those to boost search results.

I'll look at using the boosts as an exponent instead of a multiplier as I
think that would make sense.. also as it handles the 0 case.

This is for a prototype I am doing but I'll share the results one day in a
meetup as I think it'll be kinda interesting.

Thanks again
Amit


On Thu, Nov 14, 2013 at 11:11 AM, Chris Hostetter
<ho...@fucit.org>wrote:

>
> : I have a question around boosting. I wanted to use the &boost= to write a
> : nested query that will boost a document based on categorical preferences.
>
> You have no idea how stoked I am to see you working on this in a real
> world application.
>
> : Currently I have the weights set to the z-score equivalent of a user's
> : preference for that category which is simply how many standard deviations
> : above the global average is this user's preference for that movie
> category.
> :
> : My question though is basically whether or not semantically the equation
> : query(category:Drama)*<some weight> + query(category:Comedy)*<some
> weight>
> : + query(category:Action)*<some weight> makes sense?
>
> My gut says that your apprach makes sense -- but if i'm
> understadning you correclty, i think that you need to add "1" to
> all your weights: the "boost" is a multiplier, so if someone's rating for
> every category is is 0 std devs above the average rating (ie: the most
> average person imaginable), you don't wnat to give every moving in every
> category a score of 0.
>
> Are you picking the "top 3" categories the user prefers as a cut off, or
> are you arbitrarily using N category boosts for however many N categories
> the user is above the global average in their pref for that category?
>
> Are your prefrences coming from explicit user feedback on the categories
> (ie: "rate how much you like comedies on a scale of 1-5") or are you
> infering it from user ratings of the movies themselves? (ie: "rate this
> movie, which happens to be an scifi,action,comedy, on a scale of 1-5") ...
> because if it's hte later you probably want to be careful to also
> normalize based on how many categories the movie is in.
>
> the other thing to consider is wether you want to include "negative
> prefrences" (ie: weights less then 1) based on how many std dev the user's
> average is *below* the global average for a category .. in this case i
> *think* you'd want to divide the raw value from -1 to get a useful
> multiplier.
>
> Alternatively: you oculd experiment with using the weights as exponents
> instead of multipliers...
>
>
> b=sum(pow(query($cat1),1.482),pow(query($cat2),0.1199),pow(query($cat3),1.448))
>
> ...that would simplify the math you'd have to worry about both for the
> "totally boring average user" (x**0 = 1) and for the categories users hate
> (x**-5 = some positive fraction that will act as a penalty) ... but you'd
> definitley need to run some tests to see if it "over boosts" as the std
> dev variations get really high (might want to take a root first before
> using them as the exponent)
>
>
>
> -Hoss
>

Re: Boosting documents by categorical preferences

Posted by Chris Hostetter <ho...@fucit.org>.

: I have a question around boosting. I wanted to use the &boost= to write a
: nested query that will boost a document based on categorical preferences.

You have no idea how stoked I am to see you working on this in a real 
world application.

: Currently I have the weights set to the z-score equivalent of a user's
: preference for that category which is simply how many standard deviations
: above the global average is this user's preference for that movie category.
: 
: My question though is basically whether or not semantically the equation
: query(category:Drama)*<some weight> + query(category:Comedy)*<some weight>
: + query(category:Action)*<some weight> makes sense?

My gut says that your apprach makes sense -- but if i'm 
understadning you correclty, i think that you need to add "1" to 
all your weights: the "boost" is a multiplier, so if someone's rating for 
every category is is 0 std devs above the average rating (ie: the most 
average person imaginable), you don't wnat to give every moving in every 
category a score of 0.

Are you picking the "top 3" categories the user prefers as a cut off, or 
are you arbitrarily using N category boosts for however many N categories 
the user is above the global average in their pref for that category?

Are your prefrences coming from explicit user feedback on the categories 
(ie: "rate how much you like comedies on a scale of 1-5") or are you 
infering it from user ratings of the movies themselves? (ie: "rate this 
movie, which happens to be an scifi,action,comedy, on a scale of 1-5") ... 
because if it's hte later you probably want to be careful to also 
normalize based on how many categories the movie is in.

the other thing to consider is wether you want to include "negative 
prefrences" (ie: weights less then 1) based on how many std dev the user's 
average is *below* the global average for a category .. in this case i 
*think* you'd want to divide the raw value from -1 to get a useful 
multiplier.

Alternatively: you oculd experiment with using the weights as exponents 
instead of multipliers...

b=sum(pow(query($cat1),1.482),pow(query($cat2),0.1199),pow(query($cat3),1.448))

...that would simplify the math you'd have to worry about both for the 
"totally boring average user" (x**0 = 1) and for the categories users hate 
(x**-5 = some positive fraction that will act as a penalty) ... but you'd 
definitley need to run some tests to see if it "over boosts" as the std 
dev variations get really high (might want to take a root first before 
using them as the exponent)



-Hoss