You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Dave Williford <da...@gmail.com> on 2010/07/22 00:54:25 UTC

Adding weighting to boolean data

We are currently using LogLiklihoodSimilarity to create item recommendations
based on page visits on our web site.  We would like to influence the
generated recommendations for such factors as age of visit (weigh more
recent visits more heavily), duration of page view (longer is better), same
visit is better than cross-visit (things looked at on the same day are more
related than items looked at by a given user across visits).

I am considering introducing scores for each user/page data point.  This
would essentially replace the integer calculations (which are based on
summing total data points for each item, total items, and the intersection
of item A with item B) with real numbers.  We could always round the sums to
integers before sending through the loglikelihood calculation although I am
not sure this is necessary.

Note these score are not the same conceptually as preferences so I don't
think switching to a preference based algorithm would give satisfactory
results.

I am very new to all of this and am wondering if I am completely off base or
if this seems like a valid approach.  Any input is much appreciated.

Re: Adding weighting to boolean data

Posted by Ted Dunning <te...@gmail.com>.

As a matter of practice, the effect of highly structured opportunities does
indeed create a problem.  This is especially bad when the historical record
is filled with rule based results because those tend to have no information
outside the well-worn tracks.

My own approach for dealing with this is to add some noise to recommendation
lists.  Usually this noise takes the form of dithering of result order so
that results deep down in the result list occasionally make it to the first
page.  Additional exploration can be driven by having a "recently added"
page or "most popular" or ontology based navigational pages.  Since the
effect of log data quickly saturates (even more so when you down-sample
excessively popular items) these extra sources do not have to be a large
fraction of your data volume in order to have a distinctly positive effect.
 Similarly, it is helpful to introduce alternative recommendations in an A/B
test fashion so that you get a diversity of data collection.

Regarding exclusion of data, I only advocate that data that doesn't help be
excluded.  If it helps give a better user experience, then including is
warranted.  Likewise with sampling or downweighting certain sources, unless
the volume is massive and redundant, I wouldn't worry all that much about
that.

On Thu, Jul 22, 2010 at 7:29 AM, Dave Williford <
dave.williford@alumni.utexas.net> wrote:

> One other factor I wanted to mention is what type of link was used to get
> to
> a page.  Here I am referring to either a recommended link (generated by
> this
> engine) or an "organic" link (everything else).  I am concerned that if we
> do not discount the impact of recommended links, we will be "stuffing the
> ballot box" and create a positive feedback loop that will drown out organic
> traffic.  I do not want to completely ignore recommended links because it
> seems like this would be saying if a visitor uses a recommended link, they
> would not have found the page organically (which obviously isn't the case
> or
> we would never have made the recommendation in the first place).  Not sure
> how to tune this to make it a neutral factor.
>
> With all of these factors, it sounds like you are recommending that instead
> of adjusting the weight of individual data points, we should just
> include/exclude whole data points.  So instead of saying recommended links
> count as 0.7, we would only include 70% of them for a given item and keep
> the boolean scoring.
>
>
> On Wed, Jul 21, 2010 at 9:03 PM, Dave Williford <
> dave.williford@alumni.utexas.net> wrote:
>
> > Due to the nature of our data collection to date, we effectively have the
> > all or nothing approach (over a certain duration is in else it is out).
> > This will be changing and so we would at least have the opportunity to be
> > more granular.  I agree that it would be very difficult to assign a
> > meaningful gradient of values to duration.  I like the binary nature of
> this
> > attribute.
> >
> > The same visit vs. cross-visit linkage weighting is something the
> business
> > clients are convinced we need and we have currently implemented it by
> > running the engine twice, once for same visit and once for cross-visit
> then
> > merging the results using a weighted average.  After looking at the
> results,
> > I am pretty sure this is not the right way to do this.  If we were to
> stick
> > with binary data, we should probably be using a sampling of the
> cross-visit
> > data points as a way to down-weigh their impact.
> >
> > I am intrigued by your approach of going back in a data set far enough to
> > get data for certain items.  I think we are currently using too much
> history
> > which will cause our recommendations to be less responsive to current
> > traffic; however, we do have items that do not have a lot of traffic and
> we
> > run the risk of having no recommendations for them if we cut down on the
> > overall data set.  With that said, this approach seems like it would be
> very
> > computationally intensive.  By fixing the count for the item of interest
> and
> > letting the total sample count vary, aren't you creating varying data
> sets
> > for each item you want to generate recommendations for.
> >
> > Thanks for feedback.  This is very helpful.
> >
> >
> >
> >
> > On Wed, Jul 21, 2010 at 7:30 PM, Ted Dunning <ted.dunning@gmail.com
> >wrote:
> >
> >> Another thought here is that in the past, my own designs have
> essentially
> >> done this, but a bit more crudely.  We "decayed" old data in the sense
> >> that
> >> we kept all the data we could eat in the time allowed for daily
> >> processing.
> >>  We "down-weighted" short visits by putting a threshold on the viewing
> >> time
> >> and ignored all views shorter than the threshold.  This heavy hammer
> >> approach is actually kind of hard to beat largely because it is very
> hard
> >> to
> >> find much gain in data that you already know that you don't particularly
> >> like to talk about (which is why you are down-weighting it).  The
> >> exception
> >> to the all or nothing approach was in sampling of data for popular
> items.
> >>  There we looked back as far as necessary to get enough data for each
> item
> >> up to our limit.  That focused popular items on the recent past, but
> used
> >> longer-term averages for the fringe or long-tail items.
> >>
> >> These mechanisms worked pretty well for us and if it were to do over
> >> again,
> >> I would definitely not spend the time implementing a fancy weighting
> >> scheme
> >> without evidence that it would actually help things.  Even just figuring
> >> out
> >> how to parametrize, measure and optimize the weighting scheme is a big
> >> undertaking which would make me even less likely to consider it as an
> >> early
> >> design option.
> >>
> >> On Wed, Jul 21, 2010 at 5:02 PM, Ted Dunning <te...@gmail.com>
> >> wrote:
> >>
> >> >
> >> > This is, roughly, a reasonable thing to do.
> >> >
> >> > If you want to maintain the fiction of counts a little bit more
> closely,
> >> > you might consider just having counts decay over time and having short
> >> > visits only give partial credit.
> >> >
> >> >
> >> > On Wed, Jul 21, 2010 at 3:54 PM, Dave Williford <
> >> dave.williford@gmail.com>wrote:
> >> >
> >> >> We are currently using LogLiklihoodSimilarity to create item
> >> >> recommendations
> >> >> based on page visits on our web site.  We would like to influence the
> >> >> generated recommendations for such factors as age of visit (weigh
> more
> >> >> recent visits more heavily), duration of page view (longer is
> better),
> >> >> same
> >> >> visit is better than cross-visit (things looked at on the same day
> are
> >> >> more
> >> >> related than items looked at by a given user across visits).
> >> >>
> >> >> I am considering introducing scores for each user/page data point.
> >>  This
> >> >> would essentially replace the integer calculations (which are based
> on
> >> >> summing total data points for each item, total items, and the
> >> intersection
> >> >> of item A with item B) with real numbers.  We could always round the
> >> sums
> >> >> to
> >> >> integers before sending through the loglikelihood calculation
> although
> >> I
> >> >> am
> >> >> not sure this is necessary.
> >> >>
> >> >> Note these score are not the same conceptually as preferences so I
> >> don't
> >> >> think switching to a preference based algorithm would give
> satisfactory
> >> >> results.
> >> >>
> >> >> I am very new to all of this and am wondering if I am completely off
> >> base
> >> >> or
> >> >> if this seems like a valid approach.  Any input is much appreciated.
> >> >>
> >> >
> >> >
> >>
> >
> >
>

Re: Adding weighting to boolean data

Posted by Ted Dunning <te...@gmail.com>.

Dave,

Ultimately, if you want to use all this side information that you have, you
will need to go to a more nuanced technique.  This recent paper gives a very
exciting approach for this.  It is exciting because, aside from training
algorithm, it seems that it could fit fairly well into the current
Mahout/Taste framework.

Dyadic Prediction Using a Latent Feature Log-Linear Model
http://arxiv4.library.cornell.edu/abs/1006.2156

 On Thu, Jul 22, 2010 at 7:29 AM, Dave Williford <
dave.williford@alumni.utexas.net> wrote:

> One other factor I wanted to mention is what type of link was used to get
> to
> a page.  Here I am referring to either a recommended link (generated by
> this
> engine) or an "organic" link (everything else).  I am concerned that if we
> do not discount the impact of recommended links, we will be "stuffing the
> ballot box" and create a positive feedback loop that will drown out organic
> traffic.  I do not want to completely ignore recommended links because it
> seems like this would be saying if a visitor uses a recommended link, they
> would not have found the page organically (which obviously isn't the case
> or
> we would never have made the recommendation in the first place).  Not sure
> how to tune this to make it a neutral factor.
>
> With all of these factors, it sounds like you are recommending that instead
> of adjusting the weight of individual data points, we should just
> include/exclude whole data points.  So instead of saying recommended links
> count as 0.7, we would only include 70% of them for a given item and keep
> the boolean scoring.
>
>
> On Wed, Jul 21, 2010 at 9:03 PM, Dave Williford <
> dave.williford@alumni.utexas.net> wrote:
>
> > Due to the nature of our data collection to date, we effectively have the
> > all or nothing approach (over a certain duration is in else it is out).
> > This will be changing and so we would at least have the opportunity to be
> > more granular.  I agree that it would be very difficult to assign a
> > meaningful gradient of values to duration.  I like the binary nature of
> this
> > attribute.
> >
> > The same visit vs. cross-visit linkage weighting is something the
> business
> > clients are convinced we need and we have currently implemented it by
> > running the engine twice, once for same visit and once for cross-visit
> then
> > merging the results using a weighted average.  After looking at the
> results,
> > I am pretty sure this is not the right way to do this.  If we were to
> stick
> > with binary data, we should probably be using a sampling of the
> cross-visit
> > data points as a way to down-weigh their impact.
> >
> > I am intrigued by your approach of going back in a data set far enough to
> > get data for certain items.  I think we are currently using too much
> history
> > which will cause our recommendations to be less responsive to current
> > traffic; however, we do have items that do not have a lot of traffic and
> we
> > run the risk of having no recommendations for them if we cut down on the
> > overall data set.  With that said, this approach seems like it would be
> very
> > computationally intensive.  By fixing the count for the item of interest
> and
> > letting the total sample count vary, aren't you creating varying data
> sets
> > for each item you want to generate recommendations for.
> >
> > Thanks for feedback.  This is very helpful.
> >
> >
> >
> >
> > On Wed, Jul 21, 2010 at 7:30 PM, Ted Dunning <ted.dunning@gmail.com
> >wrote:
> >
> >> Another thought here is that in the past, my own designs have
> essentially
> >> done this, but a bit more crudely.  We "decayed" old data in the sense
> >> that
> >> we kept all the data we could eat in the time allowed for daily
> >> processing.
> >>  We "down-weighted" short visits by putting a threshold on the viewing
> >> time
> >> and ignored all views shorter than the threshold.  This heavy hammer
> >> approach is actually kind of hard to beat largely because it is very
> hard
> >> to
> >> find much gain in data that you already know that you don't particularly
> >> like to talk about (which is why you are down-weighting it).  The
> >> exception
> >> to the all or nothing approach was in sampling of data for popular
> items.
> >>  There we looked back as far as necessary to get enough data for each
> item
> >> up to our limit.  That focused popular items on the recent past, but
> used
> >> longer-term averages for the fringe or long-tail items.
> >>
> >> These mechanisms worked pretty well for us and if it were to do over
> >> again,
> >> I would definitely not spend the time implementing a fancy weighting
> >> scheme
> >> without evidence that it would actually help things.  Even just figuring
> >> out
> >> how to parametrize, measure and optimize the weighting scheme is a big
> >> undertaking which would make me even less likely to consider it as an
> >> early
> >> design option.
> >>
> >> On Wed, Jul 21, 2010 at 5:02 PM, Ted Dunning <te...@gmail.com>
> >> wrote:
> >>
> >> >
> >> > This is, roughly, a reasonable thing to do.
> >> >
> >> > If you want to maintain the fiction of counts a little bit more
> closely,
> >> > you might consider just having counts decay over time and having short
> >> > visits only give partial credit.
> >> >
> >> >
> >> > On Wed, Jul 21, 2010 at 3:54 PM, Dave Williford <
> >> dave.williford@gmail.com>wrote:
> >> >
> >> >> We are currently using LogLiklihoodSimilarity to create item
> >> >> recommendations
> >> >> based on page visits on our web site.  We would like to influence the
> >> >> generated recommendations for such factors as age of visit (weigh
> more
> >> >> recent visits more heavily), duration of page view (longer is
> better),
> >> >> same
> >> >> visit is better than cross-visit (things looked at on the same day
> are
> >> >> more
> >> >> related than items looked at by a given user across visits).
> >> >>
> >> >> I am considering introducing scores for each user/page data point.
> >>  This
> >> >> would essentially replace the integer calculations (which are based
> on
> >> >> summing total data points for each item, total items, and the
> >> intersection
> >> >> of item A with item B) with real numbers.  We could always round the
> >> sums
> >> >> to
> >> >> integers before sending through the loglikelihood calculation
> although
> >> I
> >> >> am
> >> >> not sure this is necessary.
> >> >>
> >> >> Note these score are not the same conceptually as preferences so I
> >> don't
> >> >> think switching to a preference based algorithm would give
> satisfactory
> >> >> results.
> >> >>
> >> >> I am very new to all of this and am wondering if I am completely off
> >> base
> >> >> or
> >> >> if this seems like a valid approach.  Any input is much appreciated.
> >> >>
> >> >
> >> >
> >>
> >
> >
>

Re: Adding weighting to boolean data

Posted by Dave Williford <da...@alumni.utexas.net>.

One other factor I wanted to mention is what type of link was used to get to
a page.  Here I am referring to either a recommended link (generated by this
engine) or an "organic" link (everything else).  I am concerned that if we
do not discount the impact of recommended links, we will be "stuffing the
ballot box" and create a positive feedback loop that will drown out organic
traffic.  I do not want to completely ignore recommended links because it
seems like this would be saying if a visitor uses a recommended link, they
would not have found the page organically (which obviously isn't the case or
we would never have made the recommendation in the first place).  Not sure
how to tune this to make it a neutral factor.

With all of these factors, it sounds like you are recommending that instead
of adjusting the weight of individual data points, we should just
include/exclude whole data points.  So instead of saying recommended links
count as 0.7, we would only include 70% of them for a given item and keep
the boolean scoring.


On Wed, Jul 21, 2010 at 9:03 PM, Dave Williford <
dave.williford@alumni.utexas.net> wrote:

> Due to the nature of our data collection to date, we effectively have the
> all or nothing approach (over a certain duration is in else it is out).
> This will be changing and so we would at least have the opportunity to be
> more granular.  I agree that it would be very difficult to assign a
> meaningful gradient of values to duration.  I like the binary nature of this
> attribute.
>
> The same visit vs. cross-visit linkage weighting is something the business
> clients are convinced we need and we have currently implemented it by
> running the engine twice, once for same visit and once for cross-visit then
> merging the results using a weighted average.  After looking at the results,
> I am pretty sure this is not the right way to do this.  If we were to stick
> with binary data, we should probably be using a sampling of the cross-visit
> data points as a way to down-weigh their impact.
>
> I am intrigued by your approach of going back in a data set far enough to
> get data for certain items.  I think we are currently using too much history
> which will cause our recommendations to be less responsive to current
> traffic; however, we do have items that do not have a lot of traffic and we
> run the risk of having no recommendations for them if we cut down on the
> overall data set.  With that said, this approach seems like it would be very
> computationally intensive.  By fixing the count for the item of interest and
> letting the total sample count vary, aren't you creating varying data sets
> for each item you want to generate recommendations for.
>
> Thanks for feedback.  This is very helpful.
>
>
>
>
> On Wed, Jul 21, 2010 at 7:30 PM, Ted Dunning <te...@gmail.com>wrote:
>
>> Another thought here is that in the past, my own designs have essentially
>> done this, but a bit more crudely.  We "decayed" old data in the sense
>> that
>> we kept all the data we could eat in the time allowed for daily
>> processing.
>>  We "down-weighted" short visits by putting a threshold on the viewing
>> time
>> and ignored all views shorter than the threshold.  This heavy hammer
>> approach is actually kind of hard to beat largely because it is very hard
>> to
>> find much gain in data that you already know that you don't particularly
>> like to talk about (which is why you are down-weighting it).  The
>> exception
>> to the all or nothing approach was in sampling of data for popular items.
>>  There we looked back as far as necessary to get enough data for each item
>> up to our limit.  That focused popular items on the recent past, but used
>> longer-term averages for the fringe or long-tail items.
>>
>> These mechanisms worked pretty well for us and if it were to do over
>> again,
>> I would definitely not spend the time implementing a fancy weighting
>> scheme
>> without evidence that it would actually help things.  Even just figuring
>> out
>> how to parametrize, measure and optimize the weighting scheme is a big
>> undertaking which would make me even less likely to consider it as an
>> early
>> design option.
>>
>> On Wed, Jul 21, 2010 at 5:02 PM, Ted Dunning <te...@gmail.com>
>> wrote:
>>
>> >
>> > This is, roughly, a reasonable thing to do.
>> >
>> > If you want to maintain the fiction of counts a little bit more closely,
>> > you might consider just having counts decay over time and having short
>> > visits only give partial credit.
>> >
>> >
>> > On Wed, Jul 21, 2010 at 3:54 PM, Dave Williford <
>> dave.williford@gmail.com>wrote:
>> >
>> >> We are currently using LogLiklihoodSimilarity to create item
>> >> recommendations
>> >> based on page visits on our web site.  We would like to influence the
>> >> generated recommendations for such factors as age of visit (weigh more
>> >> recent visits more heavily), duration of page view (longer is better),
>> >> same
>> >> visit is better than cross-visit (things looked at on the same day are
>> >> more
>> >> related than items looked at by a given user across visits).
>> >>
>> >> I am considering introducing scores for each user/page data point.
>>  This
>> >> would essentially replace the integer calculations (which are based on
>> >> summing total data points for each item, total items, and the
>> intersection
>> >> of item A with item B) with real numbers.  We could always round the
>> sums
>> >> to
>> >> integers before sending through the loglikelihood calculation although
>> I
>> >> am
>> >> not sure this is necessary.
>> >>
>> >> Note these score are not the same conceptually as preferences so I
>> don't
>> >> think switching to a preference based algorithm would give satisfactory
>> >> results.
>> >>
>> >> I am very new to all of this and am wondering if I am completely off
>> base
>> >> or
>> >> if this seems like a valid approach.  Any input is much appreciated.
>> >>
>> >
>> >
>>
>
>

Re: Adding weighting to boolean data

Posted by Dave Williford <da...@alumni.utexas.net>.

Due to the nature of our data collection to date, we effectively have the
all or nothing approach (over a certain duration is in else it is out).
This will be changing and so we would at least have the opportunity to be
more granular.  I agree that it would be very difficult to assign a
meaningful gradient of values to duration.  I like the binary nature of this
attribute.

The same visit vs. cross-visit linkage weighting is something the business
clients are convinced we need and we have currently implemented it by
running the engine twice, once for same visit and once for cross-visit then
merging the results using a weighted average.  After looking at the results,
I am pretty sure this is not the right way to do this.  If we were to stick
with binary data, we should probably be using a sampling of the cross-visit
data points as a way to down-weigh their impact.

I am intrigued by your approach of going back in a data set far enough to
get data for certain items.  I think we are currently using too much history
which will cause our recommendations to be less responsive to current
traffic; however, we do have items that do not have a lot of traffic and we
run the risk of having no recommendations for them if we cut down on the
overall data set.  With that said, this approach seems like it would be very
computationally intensive.  By fixing the count for the item of interest and
letting the total sample count vary, aren't you creating varying data sets
for each item you want to generate recommendations for.

Thanks for feedback.  This is very helpful.

On Wed, Jul 21, 2010 at 7:30 PM, Ted Dunning <te...@gmail.com> wrote:

> Another thought here is that in the past, my own designs have essentially
> done this, but a bit more crudely.  We "decayed" old data in the sense that
> we kept all the data we could eat in the time allowed for daily processing.
>  We "down-weighted" short visits by putting a threshold on the viewing time
> and ignored all views shorter than the threshold.  This heavy hammer
> approach is actually kind of hard to beat largely because it is very hard
> to
> find much gain in data that you already know that you don't particularly
> like to talk about (which is why you are down-weighting it).  The exception
> to the all or nothing approach was in sampling of data for popular items.
>  There we looked back as far as necessary to get enough data for each item
> up to our limit.  That focused popular items on the recent past, but used
> longer-term averages for the fringe or long-tail items.
>
> These mechanisms worked pretty well for us and if it were to do over again,
> I would definitely not spend the time implementing a fancy weighting scheme
> without evidence that it would actually help things.  Even just figuring
> out
> how to parametrize, measure and optimize the weighting scheme is a big
> undertaking which would make me even less likely to consider it as an early
> design option.
>
> On Wed, Jul 21, 2010 at 5:02 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> >
> > This is, roughly, a reasonable thing to do.
> >
> > If you want to maintain the fiction of counts a little bit more closely,
> > you might consider just having counts decay over time and having short
> > visits only give partial credit.
> >
> >
> > On Wed, Jul 21, 2010 at 3:54 PM, Dave Williford <
> dave.williford@gmail.com>wrote:
> >
> >> We are currently using LogLiklihoodSimilarity to create item
> >> recommendations
> >> based on page visits on our web site.  We would like to influence the
> >> generated recommendations for such factors as age of visit (weigh more
> >> recent visits more heavily), duration of page view (longer is better),
> >> same
> >> visit is better than cross-visit (things looked at on the same day are
> >> more
> >> related than items looked at by a given user across visits).
> >>
> >> I am considering introducing scores for each user/page data point.  This
> >> would essentially replace the integer calculations (which are based on
> >> summing total data points for each item, total items, and the
> intersection
> >> of item A with item B) with real numbers.  We could always round the
> sums
> >> to
> >> integers before sending through the loglikelihood calculation although I
> >> am
> >> not sure this is necessary.
> >>
> >> Note these score are not the same conceptually as preferences so I don't
> >> think switching to a preference based algorithm would give satisfactory
> >> results.
> >>
> >> I am very new to all of this and am wondering if I am completely off
> base
> >> or
> >> if this seems like a valid approach.  Any input is much appreciated.
> >>
> >
> >
>

Re: Adding weighting to boolean data

Posted by Ted Dunning <te...@gmail.com>.

Another thought here is that in the past, my own designs have essentially
done this, but a bit more crudely.  We "decayed" old data in the sense that
we kept all the data we could eat in the time allowed for daily processing.
 We "down-weighted" short visits by putting a threshold on the viewing time
and ignored all views shorter than the threshold.  This heavy hammer
approach is actually kind of hard to beat largely because it is very hard to
find much gain in data that you already know that you don't particularly
like to talk about (which is why you are down-weighting it).  The exception
to the all or nothing approach was in sampling of data for popular items.
 There we looked back as far as necessary to get enough data for each item
up to our limit.  That focused popular items on the recent past, but used
longer-term averages for the fringe or long-tail items.

These mechanisms worked pretty well for us and if it were to do over again,
I would definitely not spend the time implementing a fancy weighting scheme
without evidence that it would actually help things.  Even just figuring out
how to parametrize, measure and optimize the weighting scheme is a big
undertaking which would make me even less likely to consider it as an early
design option.

On Wed, Jul 21, 2010 at 5:02 PM, Ted Dunning <te...@gmail.com> wrote:

>
> This is, roughly, a reasonable thing to do.
>
> If you want to maintain the fiction of counts a little bit more closely,
> you might consider just having counts decay over time and having short
> visits only give partial credit.
>
>
> On Wed, Jul 21, 2010 at 3:54 PM, Dave Williford <da...@gmail.com>wrote:
>
>> We are currently using LogLiklihoodSimilarity to create item
>> recommendations
>> based on page visits on our web site.  We would like to influence the
>> generated recommendations for such factors as age of visit (weigh more
>> recent visits more heavily), duration of page view (longer is better),
>> same
>> visit is better than cross-visit (things looked at on the same day are
>> more
>> related than items looked at by a given user across visits).
>>
>> I am considering introducing scores for each user/page data point.  This
>> would essentially replace the integer calculations (which are based on
>> summing total data points for each item, total items, and the intersection
>> of item A with item B) with real numbers.  We could always round the sums
>> to
>> integers before sending through the loglikelihood calculation although I
>> am
>> not sure this is necessary.
>>
>> Note these score are not the same conceptually as preferences so I don't
>> think switching to a preference based algorithm would give satisfactory
>> results.
>>
>> I am very new to all of this and am wondering if I am completely off base
>> or
>> if this seems like a valid approach.  Any input is much appreciated.
>>
>
>

Re: Adding weighting to boolean data

Posted by Ted Dunning <te...@gmail.com>.

This is, roughly, a reasonable thing to do.

If you want to maintain the fiction of counts a little bit more closely, you
might consider just having counts decay over time and having short visits
only give partial credit.

On Wed, Jul 21, 2010 at 3:54 PM, Dave Williford <da...@gmail.com>wrote:

> We are currently using LogLiklihoodSimilarity to create item
> recommendations
> based on page visits on our web site.  We would like to influence the
> generated recommendations for such factors as age of visit (weigh more
> recent visits more heavily), duration of page view (longer is better), same
> visit is better than cross-visit (things looked at on the same day are more
> related than items looked at by a given user across visits).
>
> I am considering introducing scores for each user/page data point.  This
> would essentially replace the integer calculations (which are based on
> summing total data points for each item, total items, and the intersection
> of item A with item B) with real numbers.  We could always round the sums
> to
> integers before sending through the loglikelihood calculation although I am
> not sure this is necessary.
>
> Note these score are not the same conceptually as preferences so I don't
> think switching to a preference based algorithm would give satisfactory
> results.
>
> I am very new to all of this and am wondering if I am completely off base
> or
> if this seems like a valid approach.  Any input is much appreciated.
>