You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Sean Owen <sr...@gmail.com> on 2009/09/08 14:44:31 UTC

Re: Taste-GenericItemBasedRecommender

I wanted the user base to take note of the change Gokhan suggests
below. I committed a variant of it just now which does indeed notably
speed up most algorithms by more intelligently selecting possibilities
to consider. On one test I am playing with it sped things up by 50% --
in another, more like 400%. Depending on your data this could be a big
win.

Sean

On Mon, Sep 7, 2009 at 2:03 PM, Gökhan Çapan<gk...@gmail.com> wrote:
> Hi, Sean.
> I think we talked about mostSimilarItems( ) function before, about a bug in
> ItemBasedRecommender.
> I think there is another issue, about performance.
>
> mostSimilarItems function gives the list of most similar items to a given
> item.
> In computation of those items, the algorithm looks at all other items in
> data model, but if there is no user that doesn't rate 2 items together it is
> needless to look if there is a similarity between active item and that item.
>
>
>
> That is the original function that returns most similar items list in
> cf.taste.impl.recommender.GenericItemBasedRecommender:
>
>  private List<RecommendedItem> doMostSimilarItems(long itemID,
>                                                    int howMany,
>                                                    TopItems.Estimator<Long>
> estimator) throws TasteException {
>     DataModel model = getDataModel();
>     FastIDSet allItemIDs = new FastIDSet(model.getNumItems());
>     LongPrimitiveIterator it = model.getItemIDs();
>
>
>     while (it.hasNext()) {
>       allItemIDs.add(it.nextLong());
>     }
>     allItemIDs.remove(itemID);
>     return TopItems.getTopItems(howMany, allItemIDs.iterator(), null,
> estimator);
>   }
>
>
>
>
> I updated and use it that way:
>  private List<RecommendedItem> doMostSimilarItems(long itemID,
>                                                    int howMany,
>                                                    TopItems.Estimator<Long>
> estimator) throws TasteException {
>     DataModel model = getDataModel();
>
>       FastIDSet set=new FastIDSet();
>       PreferenceArray arr=model.getPreferencesForItem(itemID);
>       for(int i=0;i<arr.length();i++){
>           set.addAll(model.getItemIDsFromUser(arr.get(i).getUserID()));
>       }
>       set.remove(itemID);
>       return TopItems.getTopItems(howMany,set.iterator(),null,estimator);
>   }
>
>
>
> The only difference between two function is:
> the original one passes all items to getTopItems
> mine passes only the items that have at least one user who've rated both
> active item and that item.
>
>
>
> This little change made the algorithm pretty faster
> (For my data set it runs 4 times faster now.)
>
> I wanted to inform you, if you want to try and update the code.
> If for another reason original version of the code is better, please make me
> know.
>
>
>
>
>

Re: Taste-GenericItemBasedRecommender

Posted by Ted Dunning <te...@gmail.com>.
And for the binary case, this is exactly correct.  By describing the data as
"is-rated", we have a binary case.

On Wed, Sep 9, 2009 at 11:09 PM, Gökhan Çapan <gk...@gmail.com> wrote:

> >
> > Yes, more like it. Actually there's no "if rating is not zero" logic.
> > "0" is a real, valid rating. The test would be "if X and Y have been
> > rated".
> >
> > > 3. After the change:
> >
> > Same comment -- I don't think 'non-zero' is the right test.
> >
> I actually wanted to say same thing.  Non-zero means "is rated" in my post.




-- 
Ted Dunning, CTO
DeepDyve

Re: Taste-GenericItemBasedRecommender

Posted by Gökhan Çapan <gk...@gmail.com>.
>
> Yes, more like it. Actually there's no "if rating is not zero" logic.
> "0" is a real, valid rating. The test would be "if X and Y have been
> rated".
>
> > 3. After the change:
>
> Same comment -- I don't think 'non-zero' is the right test.
>
I actually wanted to say same thing.  Non-zero means "is rated" in my post.

-- 
Gökhan Çapan

Re: Taste-GenericItemBasedRecommender

Posted by Ted Dunning <te...@gmail.com>.
Yes.  And this last operation is only separate because it so often needs
access to row sums and overall totals.  It could be inserted into the first
MR step at the cost of clarity.

On Wed, Sep 9, 2009 at 8:48 PM, Sean Owen <sr...@gmail.com> wrote:

> Yes -- that second mapreduce is the real question! is there such a
> parallelizable black box that transforms the matrix multiply result
> into recommendations? I'm guessing there is indeed a matrix that makes
> the result into useful similarity weights, then this can be finished
> off by multiplying by the user rating vector.
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Taste-GenericItemBasedRecommender

Posted by Sean Owen <sr...@gmail.com>.
Great discussion. My take-aways are that the current implementation is
more or less the same as a matrix-based implementation. It's been
pretty specialized and probably works as fast or faster than a simple,
clean matrix implementation. But we need a matrix-based implementation
since that is distributable and the current implementation can't
really be distributed. So it won't really work past a scale of maybe
100M ratings. And that implementation is likely to be relatively
efficient, even given the mapreduce overhead, and can incorporate
familiar concepts like log-likelihood similarity, etc.

I'd like to put this on my to-do list, for after two things happen:
1. Hadoop gets sorted. Right now I can't really make progress on
Hadoop 0.20.0 period
2. Our matrix implementation is finalized -- understand we're probably
switching to some other library?

On Thu, Sep 10, 2009 at 8:39 AM, Gökhan Çapan<gk...@gmail.com> wrote:
> So, these algorithms are nearly same, in terms of pattern of computation,
> and this speed up is not enough, right?
>
> --
> Gökhan Çapan
>

Re: Taste-GenericItemBasedRecommender

Posted by Gökhan Çapan <gk...@gmail.com>.
So, these algorithms are nearly same, in terms of pattern of computation,
and this speed up is not enough, right?

-- 
Gökhan Çapan

Re: Taste-GenericItemBasedRecommender

Posted by Ted Dunning <te...@gmail.com>.
I don't think it is doing more than necessary.  And it is surprisingly fast.

Matrix like computations have lots of surprises.  Even on a single machine,
rearranging the loops can have massive effect on speed.

And, yes, the product A'A is also sparse, although less sparse than A.
Checking for anomalously large values and only keeping those can cause the
filtered A'A to be much more sparse than A.

On Wed, Sep 9, 2009 at 8:48 PM, Sean Owen <sr...@gmail.com> wrote:

> Now that works... my gut reaction is this is doing way more number
> crunching than the existing implementation does. Just producing A'A is
> going to do so much work. Yes I realize A is sparse but is the
> product?
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Taste-GenericItemBasedRecommender

Posted by Ted Dunning <te...@gmail.com>.
Let's get all loosey goosey in terms of nomenclature.

Suppose that there is a general pattern of matrix multiply that I will call
mm.  This general pattern is a function that takes a combiner function and
an aggregation function and it gives you back the function of two matrices
that you want.

A reasonable definition of this function in a made up functional language
is:

   mm(combiner, aggregrator) = lambda(a, b) {
         for (i, j) {
               r[i,j] = aggregrator(map(combiner, {a[i,k], b[k, i] | k} )
          }
          return r
    }

Normal linear algebra uses a product that looks like mm(*, +)

Shortest path, all points uses repeated functions like mm(+, max)

We might use honest to got matrix multiply for recs.  Or we might use a
version that introduces weights.  Or which sparsifies the final result.  The
important point is the pattern of computation is the same.

On Wed, Sep 9, 2009 at 8:48 PM, Sean Owen <sr...@gmail.com> wrote:

> Anyway, like I said, it doesn't seem we're really doing a dot product
> to get the result matrix, but some other computation. But is it then a
> 'matrix multiplication'?
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Taste-GenericItemBasedRecommender

Posted by Sean Owen <sr...@gmail.com>.
Yes I think we're all thinking on the right lines but I think the
narration needs some tweaking:


On Wed, Sep 9, 2009 at 11:42 PM, Gökhan Çapan <gk...@gmail.com> wrote:
> Then (A' A)ij is a similarity weight between ith and jth items.
> if Aij   is the "rating of ith user for jth item",  the highest value of
> "ith row of A' A" is the most similar item for "ith item".
> if the values in A are binary, then (A' A)ij   is  number of users who have
> rated/clicked/viewed  both item i and item j.

Yes, well as Ted says, they aren't quite usable as weights in that
form. There is some additional work to do:


> 1. basic matrix approach:
>
>    weight:=0;
>    for each user u
>        if   u rated both X and Y weight+= rating(u,X)*rating(u,Y);

One conceptual question I had is -- actually, in a matrix-based
approach, unrated items 'appear' as actual 0 ratings, which doesn't
work. So again there's a second reason the basic work isn't 'just' a
dot product. So is this really the computation?


> 2. Taste's implementation was:
>
>    weight:=0;
>    for each user u
>        if   rating(u,X) != 0 and rating(u,Y) != 0
> updateWeightWithChosenSimilarityMetric( );

Yes, more like it. Actually there's no "if rating is not zero" logic.
"0" is a real, valid rating. The test would be "if X and Y have been
rated".


> 3. After the change:

Same comment -- I don't think 'non-zero' is the right test.

You are describing how to compute a single item-item similarity. Yes,
the code was already taking advantage of sparsity as you say.

The change you proposed was at one level higher in the item-based
recommender computation. The change was to not compute this for all
items in the model, but only for candidates that could possibly be
recommended. So there's one loop outside this that would have been
like 'for each item i that user has not rated' which is now a more
complex rule that results in significantly fewer items.


Anyway, like I said, it doesn't seem we're really doing a dot product
to get the result matrix, but some other computation. But is it then a
'matrix multiplication'?

The transformations Ted describes are computing the similarity metrics
found in ItemSimilarity / UserSimilarity implementations... so the
point I am getting at is, I wonder if what the code does now is
actually equivalent to this pseudo-matrix-multiplication? I haven't
thought it through, but I would hope and imagine that it's just an
uglier (but I would hope faster, because special-purposed) version of
the same computation.

That would be good in that it would mean the current implementation
isn't 'missing something'. But Ted's last message is the real point. A
true matrix multiply can be parallelized. And true parallelization
would be a useful addition to the set of tools we provide.

Yes -- that second mapreduce is the real question! is there such a
parallelizable black box that transforms the matrix multiply result
into recommendations? I'm guessing there is indeed a matrix that makes
the result into useful similarity weights, then this can be finished
off by multiplying by the user rating vector.


Now that works... my gut reaction is this is doing way more number
crunching than the existing implementation does. Just producing A'A is
going to do so much work. Yes I realize A is sparse but is the
product? At scale that's waaay more number crunching. Is that the
price of parallelizing? because indeed I don't see a way to
parallelize other than to produce parallelized version of the simple,
if literal and heavy, matrix operations you describe.

Re: Taste-GenericItemBasedRecommender

Posted by Sean Owen <sr...@gmail.com>.
Yes, if I understand your meaning. Both steps will be computed in
Hadoop. The second step involves loading one row of A'A at a time, and
dotting it with the user vector to get a recommendation score for one
item.

On Fri, Dec 4, 2009 at 9:31 AM, Gökhan Çapan <gk...@gmail.com> wrote:
> Ok, so we will implement and item based recommender;
> - mapred jobs for constructing A'A first.
> - the online part of algorithm.((A'A)h).
>
> What do you think about my comments on the online step? Using a small subset
> of A'A

Re: Taste-GenericItemBasedRecommender

Posted by Gökhan Çapan <gk...@gmail.com>.
Ok, so we will implement and item based recommender;
- mapred jobs for constructing A'A first.
- the online part of algorithm.((A'A)h).

What do you think about my comments on the online step? Using a small subset
of A'A

On Fri, Dec 4, 2009 at 11:19 AM, Sean Owen <sr...@gmail.com> wrote:

> Sure, I'm implementing this today, just a simple version of what's
> been discussed. What would be call this? It's basically item-based
> recommendation.
>
> On Fri, Dec 4, 2009 at 8:57 AM, Gökhan Çapan <gk...@gmail.com> wrote:
> > ith row of the matrix A'A contains all items and their similarity degrees
> to
> > the item that is represented at ith column of the matrix A.
> > I guess it is enough using only a subset of A'A at the final step, that
> is,
> > the rows which represent the items that are in active user's history.
> > btw, I also want to contribute to that implementation, if we can decide
> the
> > algorithm.
>



-- 
Gökhan Çapan

Re: Taste-GenericItemBasedRecommender

Posted by Sean Owen <sr...@gmail.com>.
Sure, I'm implementing this today, just a simple version of what's
been discussed. What would be call this? It's basically item-based
recommendation.

On Fri, Dec 4, 2009 at 8:57 AM, Gökhan Çapan <gk...@gmail.com> wrote:
> ith row of the matrix A'A contains all items and their similarity degrees to
> the item that is represented at ith column of the matrix A.
> I guess it is enough using only a subset of A'A at the final step, that is,
> the rows which represent the items that are in active user's history.
> btw, I also want to contribute to that implementation, if we can decide the
> algorithm.

Re: Taste-GenericItemBasedRecommender

Posted by Gökhan Çapan <gk...@gmail.com>.
Yeah, you are right. I also think it must be the online part of the
algorithm.

On Fri, Dec 4, 2009 at 9:45 PM, Ted Dunning <te...@gmail.com> wrote:

> You are correct that only a little bit of A'A is used for any given user's
> recommendation, but it is the columns that correspond to their history, not
> the rows.  This doesn't matter much to get this wrong if A'A is still
> symmetrical, but getting it right can make your code less confusing for
> others and if you limit the number of non-zero elements in each row of A'A,
> then it becomes important to keep track of which is which.
>
> On Fri, Dec 4, 2009 at 12:57 AM, Gökhan Çapan <gk...@gmail.com> wrote:
>
> > ith row of the matrix A'A contains all items and their similarity degrees
> > to
> > the item that is represented at ith column of the matrix A.
> > I guess it is enough using only a subset of A'A at the final step, that
> is,
> > the rows which represent the items that are in active user's history.
> > btw, I also want to contribute to that implementation, if we can decide
> the
> > algorithm.
> >
> > --
> Ted Dunning, CTO
> DeepDyve
>



-- 
Gökhan Çapan

Re: Taste-GenericItemBasedRecommender

Posted by Ted Dunning <te...@gmail.com>.
You are correct that only a little bit of A'A is used for any given user's
recommendation, but it is the columns that correspond to their history, not
the rows.  This doesn't matter much to get this wrong if A'A is still
symmetrical, but getting it right can make your code less confusing for
others and if you limit the number of non-zero elements in each row of A'A,
then it becomes important to keep track of which is which.

On Fri, Dec 4, 2009 at 12:57 AM, Gökhan Çapan <gk...@gmail.com> wrote:

> ith row of the matrix A'A contains all items and their similarity degrees
> to
> the item that is represented at ith column of the matrix A.
> I guess it is enough using only a subset of A'A at the final step, that is,
> the rows which represent the items that are in active user's history.
> btw, I also want to contribute to that implementation, if we can decide the
> algorithm.
>
> --
Ted Dunning, CTO
DeepDyve

Re: Taste-GenericItemBasedRecommender

Posted by Gökhan Çapan <gk...@gmail.com>.
ith row of the matrix A'A contains all items and their similarity degrees to
the item that is represented at ith column of the matrix A.
I guess it is enough using only a subset of A'A at the final step, that is,
the rows which represent the items that are in active user's history.
btw, I also want to contribute to that implementation, if we can decide the
algorithm.

On Fri, Dec 4, 2009 at 10:33 AM, Sean Owen <sr...@gmail.com> wrote:

> Yes, this makes sense. I do need two passes. One pass converts input
> from "user,item,rating" triples into user vectors. Then the second
> step builds the co-occurrence A'A product. I agree that it will be
> faster to take a shortcut than properly compute A'A.
>
> (Though I'm curious how this works -- looks deceptively easy, this
> outer product approach. Isn't v cross v potentially huge? or likely to
> be sparse enough to not matter)
>
> I understand the final step in principle, which is to compute (A'A)h.
> But I keep guessing A'A is too big to fit in memory? So I can
> side-load the rows of A'A one at a time and compute it rather
> manually.
>
>
> On Thu, Dec 3, 2009 at 8:28 PM, Ted Dunning <te...@gmail.com> wrote:
> > I think you can merge my passes into a single pass in which you compute
> the
> > row and column sums at the same time that you compute the product.  That
> is
> > more complicated, though, and I hate fancy code.  So you are right in
> > practice that I have always had two passes.  (although pig might be
> clever
> > enough by now to merge them)
> >
> > There is another pass in which you use all of the sums to do the
> > sparsification.  I don't know if that could be done in the same pass or
> not.
>



-- 
Gökhan Çapan

Re: Taste-GenericItemBasedRecommender

Posted by Ted Dunning <te...@gmail.com>.
This product needs to happen at recommendation time.

And it is usually better to think of it in transposed form as the sum of a
few columns selected by the non-zero elements of h.  That way you retrieve
just a few columns.

On Fri, Dec 4, 2009 at 10:59 AM, Sean Owen <sr...@gmail.com> wrote:

> On Fri, Dec 4, 2009 at 6:04 PM, Jake Mannix <ja...@gmail.com> wrote:
> > How would you do this?  Take the rows of v_i cross v_i and add them
> > up?  Isn't that another MR job?
>
> No I suppose I meant take each row of A'A, one at a time, and dot with
> h. That's computing the recommendation vector with only one row in
> memory at a time.
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Taste-GenericItemBasedRecommender

Posted by Sean Owen <sr...@gmail.com>.
On Fri, Dec 4, 2009 at 6:04 PM, Jake Mannix <ja...@gmail.com> wrote:
> How would you do this?  Take the rows of v_i cross v_i and add them
> up?  Isn't that another MR job?

No I suppose I meant take each row of A'A, one at a time, and dot with
h. That's computing the recommendation vector with only one row in
memory at a time.

Re: Taste-GenericItemBasedRecommender

Posted by Ted Dunning <te...@gmail.com>.
If you down-sample during conversion from triples then you never even need
to keep an entire row in memory.  After down-sampling it won't matter.
Moreover, in the actual multiplication, you only pass around individual
summand elements of (A'A) so there is little memory required there as well.
Most of the excess space by emitting all those single elements is removed by
the combiner.  The reducer will also remove elements through
sparsification.  The result is less sparse than the original data, but is
still very sparse.

In the final form of (A'A) it may even be desirable to limit the number of
non-zero elements in a row (which breaks symmetry without much harm).  You
generally need another MR step to do this, but the effect on final
recommendation run-time cost is significant and the effect on recommendation
quality is nil.  This has the effect (in the Lucene form of the recommender)
of limiting the number of terms in each recommendation document.

On Fri, Dec 4, 2009 at 10:04 AM, Jake Mannix <ja...@gmail.com> wrote:

>  sum_i  (v_i cross v_i)
>
> is indeed going to get pretty dense, so the reducer may totally blow
> up if care is not taken - because it's all being done in memory in the
> reducer (ie yes, all of A'A lives in memory just before the reducer
> completes, in my naive impl).
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Taste-GenericItemBasedRecommender

Posted by Jake Mannix <ja...@gmail.com>.
On Fri, Dec 4, 2009 at 12:33 AM, Sean Owen <sr...@gmail.com> wrote:

>
> (Though I'm curious how this works -- looks deceptively easy, this
> outer product approach. Isn't v cross v potentially huge? or likely to
> be sparse enough to not matter)
>

v cross v is very huge, but very sparse... *but*

  sum_i  (v_i cross v_i)

is indeed going to get pretty dense, so the reducer may totally blow
up if care is not taken - because it's all being done in memory in the
reducer (ie yes, all of A'A lives in memory just before the reducer
completes, in my naive impl).


> I understand the final step in principle, which is to compute (A'A)h.
> But I keep guessing A'A is too big to fit in memory? So I can
> side-load the rows of A'A one at a time and compute it rather
> manually.
>

How would you do this?  Take the rows of v_i cross v_i and add them
up?  Isn't that another MR job?

  -jake

Re: Taste-GenericItemBasedRecommender

Posted by Jake Mannix <ja...@gmail.com>.
On Sat, Dec 5, 2009 at 5:17 PM, Sean Owen <sr...@gmail.com> wrote:

> I suggest for purposes of the project we would build implementations
> of Recommender that can consume some output from Hadoop on HDFS, like
> a SequenceFile or whatever it's called. Shouldn't be hard at all. This
> sort of hybrid approach is already what happens with slope-one -- I
> wrote some jobs to build its diffs and then you can load the output
> into SlopeOneRecommender -- which works online from there.
>

Something generic like this would be helpful, I think, as well as outputting
to a Lucene index.

I wonder how important these implementations are for the project,
> which seems like a bit of heresy -- surely Mahout needs to support
> recommendation on huge amounts of data? I think the answer's yes, but:
>
> LinkedIn and Netflix and Apple and most organizations with huge data
> to recommend from have already developed sophisticated, customized
> solutions.
>

Actually, from direct experience and conversations with principals involved,
I can tell you that you would be surprised at the unsophistication some
parts of the production systems at all three of these places (as well as,
eg.
Amazon).

Mahout could end up becoming large parts of some of their infrastructure
for doing this at some point.

Organizations with less than 100M data points or so to process don't
> need distributed architectures. They can use Mahout as-is with its
> online non-distributed recommenders pretty well. 10 lines of code and
> one big server and a day of tinkering and they have a full-on simple
> recommender engine, online or offline. And I argue that this is about
> 90% of users of the project who want recommendations.
>

Today, yes.  In a year, this number will maybe be 80%.  In 2 years - maybe
60%.  Big data is coming to smaller and smaller organizations.
Usage data doesn't need to be only internal: stuff you mine off of the web
can be used too...


> So who are these organizations that have enough data (like 1B+ data
> points) that they need something like the rocket science that LinkedIn
> needs, but can't or haven't developed such capability already
> in-house?
>
> I guess that's why I've been reluctant to engineer and complicate the
> framework to fit in offline distributed recommendation -- because this
> can become as complex as we like -- since I wonder at the 'market' for
> it. But it seems inevitable that this must exist, even if just as a
> nice clean simple reference implementation of the idea. Perhaps I
> won't go overboard on designing something complex yet here at the
> moment.
>

All of the above which I said aside: I agree that over-engineering
something to do this is not desireable.  But thinking about how we
output partially processed "matrices" for on-line recommendation
generation is something we should still do.  Maybe we're adequately
served by spitting out SequenceFiles, with a simple api for zipping
through them and producing scores using pluggable scoring functions?

  -jake

Re: Taste-GenericItemBasedRecommender

Posted by Ted Dunning <te...@gmail.com>.
The computation of cooccurrence matrices is a more general task than for
just recommendations and I of more interest to me than just
recommendations.  Cooccurrence counting is a key step in building a random
index, for instance.

I will also second Jake's comment about how even large scale organizations
could well use an efficient off-line/on-line recommendation system.  Few of
them have had the luxury of building the system they really wanted
(certainly that was true at Veoh).  By pooling resources, we can build
something of use to all of us.

On Sat, Dec 5, 2009 at 5:17 PM, Sean Owen <sr...@gmail.com> wrote:

> I guess that's why I've been reluctant to engineer and complicate the
> framework to fit in offline distributed recommendation -- because this
> can become as complex as we like -- since I wonder at the 'market' for
> it. But it seems inevitable that this must exist, even if just as a
> nice clean simple reference implementation of the idea. Perhaps I
> won't go overboard on designing something complex yet here at the
> moment.
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Taste-GenericItemBasedRecommender

Posted by Sean Owen <sr...@gmail.com>.
I suggest for purposes of the project we would build implementations
of Recommender that can consume some output from Hadoop on HDFS, like
a SequenceFile or whatever it's called. Shouldn't be hard at all. This
sort of hybrid approach is already what happens with slope-one -- I
wrote some jobs to build its diffs and then you can load the output
into SlopeOneRecommender -- which works online from there.

At least then the "hybrid" offline/online recommenders aren't yet a
third species of recommender in the framework. Perhaps there isn't
even a need for fully offline recommenders? Just jobs that can produce
supporting intermediate output for online recommenders? That'd be
tidier still.


If I may digress --

I wonder how important these implementations are for the project,
which seems like a bit of heresy -- surely Mahout needs to support
recommendation on huge amounts of data? I think the answer's yes, but:

LinkedIn and Netflix and Apple and most organizations with huge data
to recommend from have already developed sophisticated, customized
solutions.

Organizations with less than 100M data points or so to process don't
need distributed architectures. They can use Mahout as-is with its
online non-distributed recommenders pretty well. 10 lines of code and
one big server and a day of tinkering and they have a full-on simple
recommender engine, online or offline. And I argue that this is about
90% of users of the project who want recommendations.

So who are these organizations that have enough data (like 1B+ data
points) that they need something like the rocket science that LinkedIn
needs, but can't or haven't developed such capability already
in-house?

I guess that's why I've been reluctant to engineer and complicate the
framework to fit in offline distributed recommendation -- because this
can become as complex as we like -- since I wonder at the 'market' for
it. But it seems inevitable that this must exist, even if just as a
nice clean simple reference implementation of the idea. Perhaps I
won't go overboard on designing something complex yet here at the
moment.


On Sun, Dec 6, 2009 at 12:43 AM, Jake Mannix <ja...@gmail.com> wrote:
> But having a nice api for *outputting* the precomputed matrices which
> are pretty big into a format where online "queries"/recommendation
> requests can be computed I think is really key here.   We should think
> much more about what makes the most sense here.

Re: Taste-GenericItemBasedRecommender

Posted by Jake Mannix <ja...@gmail.com>.
On Sat, Dec 5, 2009 at 4:27 PM, Sean Owen <sr...@gmail.com> wrote:

The biggest pain for me here is how to rationalize all of this into an
> API. The current code is completely online. Now I'm dropping in a
> truly offline/distributed version, which is a totally different
> ballgame. And then there are all these hybrid approaches, computing
> some stuff offline and some online and requiring real-time integration
> with HDFS.


This is something I've been thinking about a lot too - at LinkedIn we do a
ton of offline Hadoop-based computation for recommendations, but then a
bunch of stuff is/can be done online.  You can do it with Lucene, as Ted
suggests (and in fact that is one of our implementations), or by doing a
lot of precomputation of results and storing them in a key-value store
(in our case, Voldemort).

But having a nice api for *outputting* the precomputed matrices which
are pretty big into a format where online "queries"/recommendation
requests can be computed I think is really key here.   We should think
much more about what makes the most sense here.

  -jake

Re: Taste-GenericItemBasedRecommender

Posted by Sean Owen <sr...@gmail.com>.
On Sat, Dec 5, 2009 at 10:27 PM, Ted Dunning <te...@gmail.com> wrote:
> Post your code and I will add the down-sampling.

Yep, I understand the point. The code is already in SVN under
org.apache.mahout.cf.taste.hadoop.item. It's still in progress --
doesn't work at the moment, but you should see all the major parts.


> Also, I typically discard counts from the user vector because that improves
> resistance to spammers.  If you do that, then the counts you are producing
> will all be 1 (cooccurrence for a single user).  That means that the
> combiner and reducer will still be required.

Yes, I don't even output the 1, just the item -> item mapping. It's
the simplest place to start.


> I may be misunderstanding what you are saying.  You may be saying that you
> will be accumulating all the cooccurrence data for all users in memory.
> That seems like a bad idea given that the combiner/reducer approach is
> pretty darned fast.

No it's never all in memory, but all coocurrence for one item would be
in memory. The reducer would get the item -> item mappings -- really,
and item -> iterator-over-items. And then from that you construct a
row of the cooccurrence matrix. I think it's leveraging the combiner
just fine as this is what's collecting all the cooccurrence for one
item into one place.


> The first approach requires access to all rows of M, but does the dot
> products using the sparse pattern of v.  The second approach inverts the
> nesting of the loops and moves the use of sparsity of v to the outer loop.
> This allows us to access only a small number of columns of M which is a huge
> win even if A is entirely in memory.  Secondarily, it also effectively lifts
> the highly repeated determination of the sparsity pattern of v out of the
> inner loop resulting in additional savings.

Yeah I get it, that does make a good bit of sense. That is a big win.


> I say this because user histories change in real-time and it is common to
> require real-time updates to recommendations.  Since the recommendation is
> so very cheap to compute it seems reasonable to do it at real time rather
> than pay the cost of storing all recommendations for all users until they
> need them.  Moveover, since the recommendation is more dense than either the
> history or the columns of (A'A) and because many common columns of A'A will
> be cached in memory, the I/O cost isn't even all that different.  That means
> that total cost of just retrieving a recommendation is comparable to the
> cost of computing it on the fly.

Yes I get this too, and makes sense as well.

The biggest pain for me here is how to rationalize all of this into an
API. The current code is completely online. Now I'm dropping in a
truly offline/distributed version, which is a totally different
ballgame. And then there are all these hybrid approaches, computing
some stuff offline and some online and requiring real-time integration
with HDFS. I'm trying to incorporate it all in a way that isn't chaos,
since I'm trying to write a chapter on this now as well. Will have to
come up with a theory and a plan to implement it.

Re: Taste-GenericItemBasedRecommender

Posted by Sean Owen <sr...@gmail.com>.
On Sun, Dec 13, 2009 at 3:24 AM, Jake Mannix <ja...@gmail.com> wrote:
>  You do the co-occurrence matrix (for item-by-item, right?) on Hadoop too,
> and that part is really fast, but computing the recommendations is very
> slow?  By what orders of magnitude, for the whole set?
>
> What are the scales you are testing with, in terms of total number of users,
> items, and ratings?

Yes, for about 10M ratings (tens of thousands of users and items) the
co-occurrence matrix counts take a couple minutes, and then recs are
on track to take a day or two.

Re: Taste-GenericItemBasedRecommender

Posted by Jake Mannix <ja...@gmail.com>.
On Sat, Dec 12, 2009 at 3:16 PM, Sean Owen <sr...@gmail.com> wrote:


Recommendations are computed for one user at a time, by multiplying
the co-occurrence matrix by the user preference vector. And then yes
it's one big job invoking computation for all users.

I'm running this all one one machine (my laptop) so it's kind of
serialized anyway. yes it was 10 seconds to compute all recs for one
user; it's a couple secs now with some more work. That's still rough
but not awful.

 So when doing a big batch of a thousand users, say, you're saying it's
taking your laptop 3 hours to do this using the Hadoop-based code (in
pseudo-distributed mode)?


All of it is on Hadoop here. It's pretty simple -- make the user

vectors, make the co-occurrence matrix (all that is quite fast), then
multiply the two to make recommendations.

 You do the co-occurrence matrix (for item-by-item, right?) on Hadoop too,
and that part is really fast, but computing the recommendations is very
slow?  By what orders of magnitude, for the whole set?

What are the scales you are testing with, in terms of total number of users,
items, and ratings?

  -jake

Re: Taste-GenericItemBasedRecommender

Posted by Sean Owen <sr...@gmail.com>.
On Sat, Dec 12, 2009 at 11:08 PM, Jake Mannix <ja...@gmail.com> wrote:
> You're not computing only one recommendation at a time, are you?
> I really need to read through the hadoop.item code, but in general, what
> is the procedure here?  If you're doing work on HDFS as a M/R job, you're
> doing a huge batch, right?  You're saying the aggregate performance is
> 10 seconds per recomendation across millions of recommendations, or
> doing a one-shot task?

Recommendations are computed for one user at a time, by multiplying
the co-occurrence matrix by the user preference vector. And then yes
it's one big job invoking computation for all users.

I'm running this all one one machine (my laptop) so it's kind of
serialized anyway. yes it was 10 seconds to compute all recs for one
user; it's a couple secs now with some more work. That's still rough
but not awful.


> offline).  Can you give a quick review of which part of this is supposed
> to be on Hadoop, which parts are done live, a kind of big picture
> description of what's going on?

All of it is on Hadoop here. It's pretty simple -- make the user
vectors, make the co-occurrence matrix (all that is quite fast), then
multiply the two to make recommendations.

Re: Taste-GenericItemBasedRecommender

Posted by Jake Mannix <ja...@gmail.com>.
On Sat, Dec 12, 2009 at 8:58 AM, Sean Owen <sr...@gmail.com> wrote:

> I've implemented this but it's still quite slow. Computing
> recommendations goes from a couple hundred ms to 10 seconds. Nothing
> wrong with this idea -- it's all the loading vectors and distributed
> stuff that's weighing it down.
>

You're not computing only one recommendation at a time, are you?
I really need to read through the hadoop.item code, but in general, what
is the procedure here?  If you're doing work on HDFS as a M/R job, you're
doing a huge batch, right?  You're saying the aggregate performance is
10 seconds per recomendation across millions of recommendations, or
doing a one-shot task?

I feel like too much of this conversation went by and I missed some
crucial piece describing the taks in a big-picture sense (and this
notion is backed up by the fact that we keep talking past each other
when it comes to which parts of this process are online and which are
offline).  Can you give a quick review of which part of this is supposed
to be on Hadoop, which parts are done live, a kind of big picture
description of what's going on?

I think that's the culprit in fact, having to load all the column
> vectors, since they're not light.
>
> One approach is to make the user vectors more sparse by throwing out
> data, though I don't like it so much.
>
> One question -- in SparseVector, can't we internally remove entries
> when they are set to 0.0? since implicitly missing entries are 0?
>

We should certainly add a "compact" method to both versions of
SparseVector, which could be periodically called to remove out any
zeroes and save on subsequent computational costs.

  -jake

Re: Taste-GenericItemBasedRecommender

Posted by Ted Dunning <te...@gmail.com>.
(redirected from mahout-user)

On Sun, Dec 13, 2009 at 1:25 PM, Jake Mannix <ja...@gmail.com> wrote:

> This is trickier with assign(Vector v, BinaryFunction f), because you have
> no
> way of testing if f.apply(x, 0) == x for all x, and here a marker interface
> might
> be necessary.  I'd also be fine with just checking that the function is one
> of
> Plus or PlusWithScale, because that's a ginormous chunk of the use case.
>

In fact, my own use case just turned out not to be a use case because I had
several things to do and several vectors to do it to.  That meant that it
was better for me to do my own iteratorNonZero.

The number of functions that have a zero unit and are not addition or scaled
addition probably is quite small (averaged over real uses).  I think you are
right that we don't need additional machinery at this time.


-- 
Ted Dunning, CTO
DeepDyve

Re: Taste-GenericItemBasedRecommender

Posted by Ted Dunning <te...@gmail.com>.
Duh.  Simplest approach wins!

On Sun, Dec 13, 2009 at 1:25 PM, Jake Mannix <ja...@gmail.com> wrote:

> > Following up on Jake's comment, could we have a marker interface that
> > indicates a function maps 0 to 0?  The abstract vector and matrix
> > implementations could use instanceOf to decide what to do.
> >
>
> I tried this when I was still trying to contribute to commons-math, but
> realized
> after several patch iterations that it's actually far easier to just do
> this:
>
>    Iterator<Element> e = (f.apply(0) == 0) ? iterateNonZero() :
> iterateAll();
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Taste-GenericItemBasedRecommender

Posted by Jake Mannix <ja...@gmail.com>.
On Sun, Dec 13, 2009 at 11:02 AM, Ted Dunning <te...@gmail.com> wrote:

> The issue is when adding a sparse vector to a dense vector (with the dense
> vector being mutated) that the dense vector doesn't know how to use the
> sparse iterator.
>

What do you mean? AbstractVector.plus already knows this in trunk - it
always calls
iterateNonZero() on the other vector, as well it should, for this
operation.  The problem
with plus() is that it makes a copy, and so this is bad for doing
accumulation.

For accumulation, you want to do assign(Vector v, BinaryFunction f), and
this needs
to be done intelligently.


> Following up on Jake's comment, could we have a marker interface that
> indicates a function maps 0 to 0?  The abstract vector and matrix
> implementations could use instanceOf to decide what to do.
>

I tried this when I was still trying to contribute to commons-math, but
realized
after several patch iterations that it's actually far easier to just do
this:

    Iterator<Element> e = (f.apply(0) == 0) ? iterateNonZero() :
iterateAll();

Even if the method call is expensive, it's only done one extra time.

This is trickier with assign(Vector v, BinaryFunction f), because you have
no
way of testing if f.apply(x, 0) == x for all x, and here a marker interface
might
be necessary.  I'd also be fine with just checking that the function is one
of
Plus or PlusWithScale, because that's a ginormous chunk of the use case.

If we had these cases (or the marker interface which covered it in general),
then yes, assign(Vector v, BinaryFunction f) could be implemented by
checking
which argument of f it was zero preserving w.r.t., and then choosing
iterators
appropriately.  Some of this logic could live in subclasses (ie. not
AbstractVector)
because for example, SparseVectors based on open-map don't iterate in order,
but do have nice random access, while IntDoublePair vectors have their own
specialties.

But this is getting far-afield - these are important optimizations, but I
have a
feeling this isn't what is causing Sean's slowness.  There's something else
I haven't been able to put my finger on...

Maybe we should continue this on mahout-dev?  We're not in user-land
anymore. :)

  -jake

Re: Taste-GenericItemBasedRecommender

Posted by Ted Dunning <te...@gmail.com>.
The issue is when adding a sparse vector to a dense vector (with the dense
vector being mutated) that the dense vector doesn't know how to use the
sparse iterator.

Following up on Jake's comment, could we have a marker interface that
indicates a function maps 0 to 0?  The abstract vector and matrix
implementations could use instanceOf to decide what to do.

On Sun, Dec 13, 2009 at 5:12 AM, Sean Owen <sr...@gmail.com> wrote:

> In my particular case, I just need "Scale" instead of "PlusWithScale",
> and that can take advantage of sparseness.
>
> My (er, Ted's) current approach is to sum SparseVectors. This takes
> advantage of sparseness already.
>
> Am I missing why a Scale/PlusWithScale implementation, when using
> sparseness, would be notably faster?
>
>

Re: Taste-GenericItemBasedRecommender

Posted by Sean Owen <sr...@gmail.com>.
My indices are all over the range 0..Integer.MAX_VALUE since they are
hashed from the actual long item IDs, so I imagine a DenseVector is a
non-starter.

Yeah would appreciate a second set of eyes on it. It's in a state
where it's worth looking at. At least I felt it was complete enough to
finish writing up in a chapter.

On Sun, Dec 13, 2009 at 5:58 PM, Jake Mannix <ja...@gmail.com> wrote:
> It looks like your results should probably be fairly dense in the end,
> right?  So you should be accumulating SparseVectors on a DenseVector output,
> at least...  but this should only be a small speedup, not a lower
> complexity... as you say, you're already taking advantage of sparseness in
> the sums...
>
> I need to dig in and try this myself tonight or tomorrow night...

Re: Taste-GenericItemBasedRecommender

Posted by Jake Mannix <ja...@gmail.com>.
It looks like your results should probably be fairly dense in the end,
right?  So you should be accumulating SparseVectors on a DenseVector output,
at least...  but this should only be a small speedup, not a lower
complexity... as you say, you're already taking advantage of sparseness in
the sums...

I need to dig in and try this myself tonight or tomorrow night...

On Dec 13, 2009 5:12 AM, "Sean Owen" <sr...@gmail.com> wrote:

In my particular case, I just need "Scale" instead of "PlusWithScale",
and that can take advantage of sparseness.

My (er, Ted's) current approach is to sum SparseVectors. This takes
advantage of sparseness already.

Am I missing why a Scale/PlusWithScale implementation, when using
sparseness, would be notably faster?

On Sun, Dec 13, 2009 at 11:11 AM, Jake Mannix <ja...@gmail.com> wrote:
> On Sat, Dec 12, 2009...

Re: Taste-GenericItemBasedRecommender

Posted by Sean Owen <sr...@gmail.com>.
In my particular case, I just need "Scale" instead of "PlusWithScale",
and that can take advantage of sparseness.

My (er, Ted's) current approach is to sum SparseVectors. This takes
advantage of sparseness already.

Am I missing why a Scale/PlusWithScale implementation, when using
sparseness, would be notably faster?

On Sun, Dec 13, 2009 at 11:11 AM, Jake Mannix <ja...@gmail.com> wrote:
> On Sat, Dec 12, 2009 at 5:34 PM, Ted Dunning <te...@gmail.com> wrote:
>
>> This is a key problem.
>>
>> Looks like we really need to think about versions of assign that only scan
>> non-zero elements.  Something like assignFromNonZero.
>>
>
> This is easier than the rest of the stuff we are talking about here.
> Whenever
> you have a UnaryFunction f such that f.apply(0) == 0, or you have a
> BinaryFunction b such that b.appy(x, 0) == x or b.apply(0, x) == x, then
> assign
> should take advantage of this and iterateNonZero() on the vector which can
> be skipped over.  In Colt, they actually special-case out functions like
> Plus
> or Minus to take care of this.
>
> In our case, a particular, PlusWithScale is a very very common function
> for Vectors to do, and could be special cased itself, in AbstractVector.
>
>  -jake
>

Re: Taste-GenericItemBasedRecommender

Posted by Jake Mannix <ja...@gmail.com>.
On Sat, Dec 12, 2009 at 5:34 PM, Ted Dunning <te...@gmail.com> wrote:

> This is a key problem.
>
> Looks like we really need to think about versions of assign that only scan
> non-zero elements.  Something like assignFromNonZero.
>

This is easier than the rest of the stuff we are talking about here.
Whenever
you have a UnaryFunction f such that f.apply(0) == 0, or you have a
BinaryFunction b such that b.appy(x, 0) == x or b.apply(0, x) == x, then
assign
should take advantage of this and iterateNonZero() on the vector which can
be skipped over.  In Colt, they actually special-case out functions like
Plus
or Minus to take care of this.

In our case, a particular, PlusWithScale is a very very common function
for Vectors to do, and could be special cased itself, in AbstractVector.

  -jake

Re: Taste-GenericItemBasedRecommender

Posted by Ted Dunning <te...@gmail.com>.
This is a key problem.

Looks like we really need to think about versions of assign that only scan
non-zero elements.  Something like assignFromNonZero.

What is the sense of how that should look?

On Sat, Dec 12, 2009 at 2:54 PM, Sean Owen <sr...@gmail.com> wrote:

> Thanks, it makes sense, though in practice this didn't perform so
> well. assign() can't take advantage of sparseness.
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Taste-GenericItemBasedRecommender

Posted by Sean Owen <sr...@gmail.com>.
Thanks, it makes sense, though in practice this didn't perform so
well. assign() can't take advantage of sparseness.

After throwing in some caching and other improvements performance is,
well, bad in my book but not awful. It's perhaps the price of the
distributed nature of this computation -- or else I'm doing something
wrong.

I think it's working enough for my purposes at the moment.

On Sat, Dec 12, 2009 at 7:48 PM, Ted Dunning <te...@gmail.com> wrote:
> If the vector matrix product is done like this:
>
>      Vector w = v.like();
>      MultiplyAdd scale = new MultiplyAdd();
>      while (v.iterateNonZero().hasNext()) {
>        Vector.Element element = v.iterateNonZero().next();
>        scale.setScale(element.get());
>        w.assign(getColumn(element.index()), scale);
>      }
>

Re: Taste-GenericItemBasedRecommender

Posted by Ted Dunning <te...@gmail.com>.
On Sat, Dec 12, 2009 at 8:58 AM, Sean Owen <sr...@gmail.com> wrote:

> ...
> I think that's the culprit in fact, having to load all the column
> vectors, since they're not light.
>

If the vector matrix product is done like this:

      Vector w = v.like();
      MultiplyAdd scale = new MultiplyAdd();
      while (v.iterateNonZero().hasNext()) {
        Vector.Element element = v.iterateNonZero().next();
        scale.setScale(element.get());
        w.assign(getColumn(element.index()), scale);
      }

Then you might have a better speed, especially since you can lazy load just
the columns you want.  Google collections has an interesting dynamic map
builder that would be useful for this.


>
> One approach is to make the user vectors more sparse by throwing out
> data, though I don't like it so much.
>

This is often useful actually, but more in the sense of only retaining
recent events than sparsification.  If you get lots of data per user, then
this isn't much of a problem.  If you use rarer data, then you may have more
of an issue (ratings are the prime example).


>
> One question -- in SparseVector, can't we internally remove entries
> when they are set to 0.0? since implicitly missing entries are 0?
>
>
Absolutely.  Depending on representation, deletion of elements may not help
so much unless subsequent elements are added that fill in the holes.

-- 
Ted Dunning, CTO
DeepDyve

Re: Taste-GenericItemBasedRecommender

Posted by Sean Owen <sr...@gmail.com>.
I've implemented this but it's still quite slow. Computing
recommendations goes from a couple hundred ms to 10 seconds. Nothing
wrong with this idea -- it's all the loading vectors and distributed
stuff that's weighing it down.

I think that's the culprit in fact, having to load all the column
vectors, since they're not light.

One approach is to make the user vectors more sparse by throwing out
data, though I don't like it so much.

One question -- in SparseVector, can't we internally remove entries
when they are set to 0.0? since implicitly missing entries are 0?

On Sat, Dec 5, 2009 at 10:27 PM, Ted Dunning <te...@gmail.com> wrote:
> The first approach requires access to all rows of M, but does the dot
> products using the sparse pattern of v.  The second approach inverts the
> nesting of the loops and moves the use of sparsity of v to the outer loop.
> This allows us to access only a small number of columns of M which is a huge
> win even if A is entirely in memory.  Secondarily, it also effectively lifts
> the highly repeated determination of the sparsity pattern of v out of the
> inner loop resulting in additional savings.
>

Re: Taste-GenericItemBasedRecommender

Posted by Ted Dunning <te...@gmail.com>.
On Sat, Dec 5, 2009 at 1:42 AM, Sean Owen <sr...@gmail.com> wrote:

> On Fri, Dec 4, 2009 at 7:35 PM, Ted Dunning <te...@gmail.com> wrote:
> > The preferable approach is for the first MR step to group by user as
> before,
> > then in the reduce down-sample the user items if desired and output that
> > list in a single record.  Down-sampling can be done on-line keeping just
> the
> > retained elements in memory.  Second MR would produce the cross product
> in
> > the mapper and use a combiner and reducer.
>
> That's what I'm doing -- outputting a Vector per user in the first MR.
> (I'm leaving out the extras like downsampling until the basic approach
> works.)
>

If any users have looked at more than a 1000 items, downsampling becomes
very, very important.  The problem is that the number of non-zero summands
for a user is proportional to the square of the number of items for that
user.  The most active users quickly dominate the total cost.

If your user population follows Zipf's law exactly, this causes the A'A
computation to asymptotically cost quadratic time in the number of users.
In practice, the situation is slightly better because the most active 1% of
all users are not as active as Zipf would predict, but the computation cost
is still substantially super-linear and thus is not scalable even at
relatively moderate numbers of users.

Post your code and I will add the down-sampling.


>
> I think I'm going a different way to produce the cooccurrence matrix -
> no cross product, just counting and outputting all cooccurrence, and
> outputting item1ID -> item2ID as key-value pairs. That makes it tidy
> to produce the rows of the cooccurrence matrix in the reducer.
>

This is essentially what the combiner would be doing.  Your approach may be
a bit faster since the data is moving less.  The merge sorting used by
hadoop may improve locality which could offset your advantage, depending on
how large the cross product is.

Also, I typically discard counts from the user vector because that improves
resistance to spammers.  If you do that, then the counts you are producing
will all be 1 (cooccurrence for a single user).  That means that the
combiner and reducer will still be required.

I may be misunderstanding what you are saying.  You may be saying that you
will be accumulating all the cooccurrence data for all users in memory.
That seems like a bad idea given that the combiner/reducer approach is
pretty darned fast.

> Another approach is to make each column of A'A be stored in a key-value
> > store.  At recommendation time, you retrieve columns and add them.  This
> is
> > essentially equivalent to the Lucene approach without lucene.  Because we
> > know a lot about the contents (they are integers), you can probably write
> > tighter code than Lucene can use.  This would be a great use for the
> fancy
> > concurrent map builder that is in Google collections, for instance.
>
> Sounds cool but don't I need the rows of A'A to multiply against h? h
> is a column vector.
>

You can view matrix vector multiplication in two different ways depending on
how you nest the loops.  One way (what you are talking about) is to say that
the result is a vector of dot products.  This corresponds to this pseudo
code:

      for (i in 1:n) {
          r[i] = A.row(n).dot(v)
     }

In this approach, the inner loop is the dot product and we require row-wise
access to M

The other way to look at it is as the weighted sum of vectors.

      r = zero(n)
      for (j in 1:m) {
          r += v[j] * M.column[j]
      }

Here the inner loop is the addition to the result and we require column-wise
access to M.

This second approach can be rewritten, though,

      r = zero(n)
      for (j in v.nonZeros()) {
           r += v[j] * M.column[j]
      }

The first approach requires access to all rows of M, but does the dot
products using the sparse pattern of v.  The second approach inverts the
nesting of the loops and moves the use of sparsity of v to the outer loop.
This allows us to access only a small number of columns of M which is a huge
win even if A is entirely in memory.  Secondarily, it also effectively lifts
the highly repeated determination of the sparsity pattern of v out of the
inner loop resulting in additional savings.


> Also why did you later say recommendation must occur online? seems
> quite doable offline and my picture of the point of this whole Hadoop
> framework is doing things offline. They've already gone to the trouble
> of running a cluster and have given up doing it entirely online, so...
>

I say this because user histories change in real-time and it is common to
require real-time updates to recommendations.  Since the recommendation is
so very cheap to compute it seems reasonable to do it at real time rather
than pay the cost of storing all recommendations for all users until they
need them.  Moveover, since the recommendation is more dense than either the
history or the columns of (A'A) and because many common columns of A'A will
be cached in memory, the I/O cost isn't even all that different.  That means
that total cost of just retrieving a recommendation is comparable to the
cost of computing it on the fly.

-- 
Ted Dunning, CTO
DeepDyve

Re: Taste-GenericItemBasedRecommender

Posted by Sean Owen <sr...@gmail.com>.
On Fri, Dec 4, 2009 at 7:35 PM, Ted Dunning <te...@gmail.com> wrote:
> The preferable approach is for the first MR step to group by user as before,
> then in the reduce down-sample the user items if desired and output that
> list in a single record.  Down-sampling can be done on-line keeping just the
> retained elements in memory.  Second MR would produce the cross product in
> the mapper and use a combiner and reducer.

That's what I'm doing -- outputting a Vector per user in the first MR.
(I'm leaving out the extras like downsampling until the basic approach works.)

I think I'm going a different way to produce the cooccurrence matrix -
no cross product, just counting and outputting all cooccurrence, and
outputting item1ID -> item2ID as key-value pairs. That makes it tidy
to produce the rows of the cooccurrence matrix in the reducer.


> Correct.  (A'A) h can be computed in several ways, but it all comes down to
> the fact that h is very sparse.  Typically you make it even sparser by
> keeping only recent history.  If you have only 50 non-zeros in h, then you
> only need 50 columns of (A'A).  These can be retrieved many different ways,
> but one cool way is to make each row of A'A be a Lucene document.  The terms
> in the documents are items and the columns of A'A are the posting vectors in
> Lucene.  The weighting that Lucene does generally helps but can easily be
> defeated if desired.

I'll hold off leveraging Lucene for later. I'll also probably start by
just loading the whole row but yeah that's not quite efficient. The
other optimizations you mention later also make sense.


> Another approach is to make each column of A'A be stored in a key-value
> store.  At recommendation time, you retrieve columns and add them.  This is
> essentially equivalent to the Lucene approach without lucene.  Because we
> know a lot about the contents (they are integers), you can probably write
> tighter code than Lucene can use.  This would be a great use for the fancy
> concurrent map builder that is in Google collections, for instance.

Sounds cool but don't I need the rows of A'A to multiply against h? h
is a column vector.


Also why did you later say recommendation must occur online? seems
quite doable offline and my picture of the point of this whole Hadoop
framework is doing things offline. They've already gone to the trouble
of running a cluster and have given up doing it entirely online, so...

Re: Taste-GenericItemBasedRecommender

Posted by Ted Dunning <te...@gmail.com>.
On Fri, Dec 4, 2009 at 12:33 AM, Sean Owen <sr...@gmail.com> wrote:

> Yes, this makes sense. I do need two passes. One pass converts input
> from "user,item,rating" triples into user vectors. Then the second
> step builds the co-occurrence A'A product. I agree that it will be
> faster to take a shortcut than properly compute A'A.
>

You can work directly with the triples as input if you have relatively small
input.  Mapper puts user as key, and (item) as value (this is cooccurrence
only so rating should only be used to filter which items get output).
Reducer takes list of items and produces cross product on output.  It
doesn't need to keep the full cross product.  Second MR positions (item1,
item2) as key and basically does a word count.

This approach has the problem of a large, uncombinable intermediate file.

The preferable approach is for the first MR step to group by user as before,
then in the reduce down-sample the user items if desired and output that
list in a single record.  Down-sampling can be done on-line keeping just the
retained elements in memory.  Second MR would produce the cross product in
the mapper and use a combiner and reducer.

If we wanted to follow up on Jake's idea, the transpose can be done in the
first MR step but it is hard to down-sample the user's items that way.

The first step of converting triples to vectors should also produce user and
item counts so that the second step can do the sparsification cleanly.


> (Though I'm curious how this works -- looks deceptively easy, this
> outer product approach. Isn't v cross v potentially huge? or likely to
> be sparse enough to not matter)
>

The cost is dominated by the square of the number of items that the most
prolific user has.  If you down sample that to something reasonable, then
you bound the total cost very tightly.  It is very unusual for the massively
prolific users to add any information ... generally they are spammers or
bots anyway.


> I understand the final step in principle, which is to compute (A'A)h.
> But I keep guessing A'A is too big to fit in memory?


Correct.  (A'A) h can be computed in several ways, but it all comes down to
the fact that h is very sparse.  Typically you make it even sparser by
keeping only recent history.  If you have only 50 non-zeros in h, then you
only need 50 columns of (A'A).  These can be retrieved many different ways,
but one cool way is to make each row of A'A be a Lucene document.  The terms
in the documents are items and the columns of A'A are the posting vectors in
Lucene.  The weighting that Lucene does generally helps but can easily be
defeated if desired.

Another approach is to make each column of A'A be stored in a key-value
store.  At recommendation time, you retrieve columns and add them.  This is
essentially equivalent to the Lucene approach without lucene.  Because we
know a lot about the contents (they are integers), you can probably write
tighter code than Lucene can use.  This would be a great use for the fancy
concurrent map builder that is in Google collections, for instance.


-- 
Ted Dunning, CTO
DeepDyve

Re: Taste-GenericItemBasedRecommender

Posted by Sean Owen <sr...@gmail.com>.
Yes, this makes sense. I do need two passes. One pass converts input
from "user,item,rating" triples into user vectors. Then the second
step builds the co-occurrence A'A product. I agree that it will be
faster to take a shortcut than properly compute A'A.

(Though I'm curious how this works -- looks deceptively easy, this
outer product approach. Isn't v cross v potentially huge? or likely to
be sparse enough to not matter)

I understand the final step in principle, which is to compute (A'A)h.
But I keep guessing A'A is too big to fit in memory? So I can
side-load the rows of A'A one at a time and compute it rather
manually.


On Thu, Dec 3, 2009 at 8:28 PM, Ted Dunning <te...@gmail.com> wrote:
> I think you can merge my passes into a single pass in which you compute the
> row and column sums at the same time that you compute the product.  That is
> more complicated, though, and I hate fancy code.  So you are right in
> practice that I have always had two passes.  (although pig might be clever
> enough by now to merge them)
>
> There is another pass in which you use all of the sums to do the
> sparsification.  I don't know if that could be done in the same pass or not.

Re: Taste-GenericItemBasedRecommender

Posted by Ted Dunning <te...@gmail.com>.
I think you can merge my passes into a single pass in which you compute the
row and column sums at the same time that you compute the product.  That is
more complicated, though, and I hate fancy code.  So you are right in
practice that I have always had two passes.  (although pig might be clever
enough by now to merge them)

There is another pass in which you use all of the sums to do the
sparsification.  I don't know if that could be done in the same pass or not.

On Thu, Dec 3, 2009 at 10:55 AM, Jake Mannix <ja...@gmail.com> wrote:

> But yeah, in the case at hand, maybe that transposition is a bunch of work.
> But is it more work to do my two MR jobs than to do your two MR jobs?  Both
> look
> like two full passes over the data, right?
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Taste-GenericItemBasedRecommender

Posted by Jake Mannix <ja...@gmail.com>.
I guess yes, it depends on what form your data is in.  If you never have a
need
to produce the transpose, then this is costly, sure.  If you are already
building your
matrix out of something like say, a lucene index, then you already *have*
the
transpose.   Or if you have another use for doing a MR job to produce the
transpose and then keeping it around for other purposes, I guess.

But yeah, in the case at hand, maybe that transposition is a bunch of work.
But is
it more work to do my two MR jobs than to do your two MR jobs?  Both look
like two
full passes over the data, right?

  -jake

On Thu, Dec 3, 2009 at 10:46 AM, Ted Dunning <te...@gmail.com> wrote:

> I have always tried to avoid transposing the data matrix.
>
> During my design phase for this, I was working on patch-wise patterns.
> About then, Chris Dyer tried the simpler approach for a machine translation
> problem and got very good results.
>
> The major problem with the transpose is that it requires a MR job of its
> own
> that is nearly as expensive as the multiply.   The combiner really makes a
> huge difference to the inner product approach.
>
> On Thu, Dec 3, 2009 at 10:31 AM, Jake Mannix <ja...@gmail.com>
> wrote:
>
> > Wait, this is just doing A'A?  Am I misunderstanding, or is this not most
> > easily done by first transposing A into A', and then doing the outer
> > products instead of the inner products
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Re: Taste-GenericItemBasedRecommender

Posted by Ted Dunning <te...@gmail.com>.
I have always tried to avoid transposing the data matrix.

During my design phase for this, I was working on patch-wise patterns.
About then, Chris Dyer tried the simpler approach for a machine translation
problem and got very good results.

The major problem with the transpose is that it requires a MR job of its own
that is nearly as expensive as the multiply.   The combiner really makes a
huge difference to the inner product approach.

On Thu, Dec 3, 2009 at 10:31 AM, Jake Mannix <ja...@gmail.com> wrote:

> Wait, this is just doing A'A?  Am I misunderstanding, or is this not most
> easily done by first transposing A into A', and then doing the outer
> products instead of the inner products
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Taste-GenericItemBasedRecommender

Posted by Jake Mannix <ja...@gmail.com>.
Wait, this is just doing A'A?  Am I misunderstanding, or is this not most
easily done by first transposing A into A', and then doing the outer
products
instead of the inner products:

class Map extends Mapper<IntWritable, Vector, NullWritable, Matrix> {
  map(IntWritable i, Vector v, Context c) { c.write(NullWritable.get(),
v.cross(v)); }
}

class Reduce extends Reducer<NullWritable, Matrix, NullWritable, Matrix> {
  Matrix out;
  reduce(NullWritable n, Iterable<Matrix> it, Context c) {
    for(Matrix m : it) out = out.plus(m);
    c.write(n.get(), m);
  }
}

Then the outer products are computed in the mapper, and the sum is done in
the
combiners and single reducer.

Not sure how you'd fold in sparsification to that though.

  -jake

Re: Taste-GenericItemBasedRecommender

Posted by Ted Dunning <te...@gmail.com>.
I think that the question here is how to generate the A' A product.

If so, then answer is to read one row of A at a time and emit all of the
product pairs that could happen from that row.  The key is the row and
column in the result matrix.

More precisely:

    map: i, A[i,*]  =>  foreach j foreach k yield ((j,k), A[i,j] * A[i,k])
    combine and reduce: (j,k), values => (j, k), sum(values)

The output is A'A in triple form.  You might be able to do a trick with
sorting and just use j as the key.  Then you could emit an entire row at a
time.

A second MR needs to produce row sums of A'A.

This product may require down-sampling of non-sparse rows of A because it is
quadratic in the densest row count.  I usually just use random sampling for
any row with more than several hundred entries.  This is defensible because
little additional information is gained after the first hundred or so
entries.

On Thu, Dec 3, 2009 at 7:02 AM, Sean Owen <sr...@gmail.com> wrote:

> One question that maybe Ted already knows the answer to: how would I
> iterate over A_u* for each A_u*? the mapper would touch only one each.
> I guess I could use a whole MR step to generate the cartesian product
> so to speak.
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Taste-GenericItemBasedRecommender

Posted by Sean Owen <sr...@gmail.com>.
I'm going ahead with implementing something like this as a simple but
real distributed implementation.

One question that maybe Ted already knows the answer to: how would I
iterate over A_u* for each A_u*? the mapper would touch only one each.
I guess I could use a whole MR step to generate the cartesian product
so to speak.

On Thu, Sep 10, 2009 at 1:44 AM, Ted Dunning <te...@gmail.com> wrote:
> Another implementation is to iterate over users emitting cooccurrence
> pairs.  These can be sorted on the items involved and the counts
> aggregated.  This is ideal for map reduce implementation.
>
> Roughly the matrix approach becomes this:
>
>  Map:
>    input record is A_u*, all items for a particular user u
>    for each item i in A_u*
>       yield ((
>       yield ((i,*), 1)
>       for each item j in A_u*
>          yield ((i, j), 1)
>
>   Reduce:
>      input record is (i,j), (k1, k2, ... kn)
>      yield (i,j), sum k*
>
>          or
>
>       input record is (i,*), (k1, k2, ... ,kn)
>       yield (i,*) sum k*
>
> This computes the elements of A'A (item cooccurrence counts) and the row
> sums of A'A in one step.   You might want to compute the column sums of A
> first to use these for weighting A before using this program.
>
> Then you need another MR step to bring together the overall total, the row
> counts and the cell counts for final reduction.
>
> This is inside out compared with the most obvious implementation and there
> is no place where we really have two entire item vectors.  This will make it
> rather harder to have a totally simple pluggable weighting scheme that runs
> as fast.

Re: Taste-GenericItemBasedRecommender

Posted by Ted Dunning <te...@gmail.com>.
Another implementation is to iterate over users emitting cooccurrence
pairs.  These can be sorted on the items involved and the counts
aggregated.  This is ideal for map reduce implementation.

Roughly the matrix approach becomes this:

 Map:
    input record is A_u*, all items for a particular user u
    for each item i in A_u*
       yield ((
       yield ((i,*), 1)
       for each item j in A_u*
          yield ((i, j), 1)

   Reduce:
      input record is (i,j), (k1, k2, ... kn)
      yield (i,j), sum k*

          or

       input record is (i,*), (k1, k2, ... ,kn)
       yield (i,*) sum k*

This computes the elements of A'A (item cooccurrence counts) and the row
sums of A'A in one step.   You might want to compute the column sums of A
first to use these for weighting A before using this program.

Then you need another MR step to bring together the overall total, the row
counts and the cell counts for final reduction.

This is inside out compared with the most obvious implementation and there
is no place where we really have two entire item vectors.  This will make it
rather harder to have a totally simple pluggable weighting scheme that runs
as fast.



On Wed, Sep 9, 2009 at 3:42 PM, Gökhan Çapan <gk...@gmail.com> wrote:

> 3. After the change:
>   a. Matrix approach becomes:
>
>       userSet=Intersect ( nonZeroElements(A' i), nonZeroElements(A' j) );
>       weight:=0;
>       for each user u in userSet
>           weight+= rating(u,X)*rating(u,Y);
>
>
>   b: Taste's implementation becomes:
>
>       userSet=Intersect ( nonZeroElements(A' i), nonZeroElements(A' j) );
>       weight:=0;
>       for each user u in userSet
>           updateWeightWithChosenSimilarityMetric( );
>
>
> I think 3rd version is already benefiting from sparsity of data; obviously,
> as the data set gets sparser and/or # of users increases, that trick will
> speed up the algorithm more. So, doesn't it mean 3rd is an algorithm that
> uses sparse data structures that you mentioned?
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Taste-GenericItemBasedRecommender

Posted by Ted Dunning <te...@gmail.com>.
Very close.  You are conceptually exactly correct.

If A contains binary visit or view data, then A'A contains counts that must
be reduced to binary values or weights using some statistical procedure.  I
prefer LLR and binary results.

If A contains counts weighted by inverse user frequency, then your dot
product is roughly usable as a similarity score.  This is especially true if
rows of A are normalized somehow to account for over-active users.

On Wed, Sep 9, 2009 at 3:42 PM, Gökhan Çapan <gk...@gmail.com> wrote:

> A is the user x item history matrix.  Each row is a user history.
> >
> > A' is the transposed user x item matrix which is of the shape item x
> user.
> >
> > A' A is the user-level item cooccurrence matrix and has the shape item x
> > item.
> >
>
> Then (A' A)ij is a similarity weight between ith and jth items.
> if Aij   is the "rating of ith user for jth item",  the highest value of
> "ith row of A' A" is the most similar item for "ith item".
> if the values in A are binary, then (A' A)ij   is  number of users who have
> rated/clicked/viewed  both item i and item j.
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Taste-GenericItemBasedRecommender

Posted by Gökhan Çapan <gk...@gmail.com>.
A is the user x item history matrix.  Each row is a user history.
>
> A' is the transposed user x item matrix which is of the shape item x user.
>
> A' A is the user-level item cooccurrence matrix and has the shape item x
> item.
>

Then (A' A)ij is a similarity weight between ith and jth items.
if Aij   is the "rating of ith user for jth item",  the highest value of
"ith row of A' A" is the most similar item for "ith item".
if the values in A are binary, then (A' A)ij   is  number of users who have
rated/clicked/viewed  both item i and item j.

For this multiplication technique, (A' A)ij is computed by :
(A' i) x (A' j)'  ( A' i   and  A' j  are row vectors ).

So we have 3 algorithms to compute the similarity between two items
X : row vector A'i
Y : row vector A'j



1. basic matrix approach:

    weight:=0;
    for each user u
        if   u rated both X and Y weight+= rating(u,X)*rating(u,Y);


2. Taste's implementation was:

    weight:=0;
    for each user u
        if   rating(u,X) != 0 and rating(u,Y) != 0
updateWeightWithChosenSimilarityMetric( );


3. After the change:
   a. Matrix approach becomes:

       userSet=Intersect ( nonZeroElements(A' i), nonZeroElements(A' j) );
       weight:=0;
       for each user u in userSet
           weight+= rating(u,X)*rating(u,Y);


   b: Taste's implementation becomes:

       userSet=Intersect ( nonZeroElements(A' i), nonZeroElements(A' j) );
       weight:=0;
       for each user u in userSet
           updateWeightWithChosenSimilarityMetric( );


I think 3rd version is already benefiting from sparsity of data; obviously,
as the data set gets sparser and/or # of users increases, that trick will
speed up the algorithm more. So, doesn't it mean 3rd is an algorithm that
uses sparse data structures that you mentioned?


"Love spurned effect":
I totally agree, people don't give negative feedback to items that are very
dissimilar to items they like, they don't even look at such items.
I think the reason is, they give negative feedback to "false positive"
recommendations.


-- 
Gökhan Çapan

Re: Taste-GenericItemBasedRecommender

Posted by Ted Dunning <te...@gmail.com>.
A is the user x item history matrix.  Each row is a user history.

A' is the transposed user x item matrix which is of the shape item x user.

A' A is the user-level item cooccurrence matrix and has the shape item x
item.
There is one row for each item and each row contains items recommended for
the item corresponding to the row.

(A' A) h is a sum of item level recommendations.

BUT remember that it is the shape of these computations that matters rather
than
the explicit operations.  Thus,

     A could be session x item instead of user x item

     A' A could be weighted by IDF considerations

     A' A could be binarized to have only items with large LLR score.  In
fact, if A is
binary then A'A as computed using literal matrix multiply is exactly what
the LLR
score needs.  Thus we could replace A'A with ( LLR(A'A) > 10 ) or some
such.  The
pattern of computation is still the same.

Historically, there are several ways I have dealt with missing observations.

- for binary A, missing elements are most naturally 0.  This comes about
because
we are counting number of positive observations and 0 is the natural
identity for
counting.

- for situations with useful ratings, it is better to use one or more
binarized versions
of A.

    In typical cases where positive ratings dominate the process, then A can
contain 1 where there is a binary rating and 0 elsewhere and we have the
normal
binary case.

    In other cases, it is sometimes paradoxically useful to consider ALL
ratings as positive and then filter out explicit negative ratings from the
recommendation
results.  The reason that this works is that people often rate things
negatively that
are very close to things that they like.  Things very dissimilar from what
they like,
they never even look at, much less rate.  Thus, negative ratings may
actually be
more useful as indicators of what a person likes than as indicators about
what they don't.
I call this effect the "love spurned" effect.

    Finally, you can keep separate A matrices that are each binary and which
correspond
to negative and positive ratings respectively.  Then you proceed with both
computations
and get positive and negative recommendations.  Combining these
recommendations
has always been difficult for me, largely because the negative
recommendations are
confounded by the love spurned effect.


On Wed, Sep 9, 2009 at 1:39 AM, Sean Owen <sr...@gmail.com> wrote:

> So maybe let's visit the matrix approach a bit here -- what is A'A? is
> this the similarity matrix? working backwards, seems like that's the
> thing to left-multiply with the user rating vector to get
> recommendations.
>
> The first question I have is how does this cope with missing elements?
> I understand the idea is to use a sparse representation, but, the
> nature of the computation means these elements will be treated as
> zero, which doesn't work.
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Taste-GenericItemBasedRecommender

Posted by Ted Dunning <te...@gmail.com>.
On Wed, Sep 9, 2009 at 1:39 AM, Sean Owen <sr...@gmail.com> wrote:

> > My focus in the past has always been to produce recs really fast
> (sometimes
> > hundreds per second).  As is generally the case, doing as much
> computation
> > ahead of time is a great way to get fast response.
>
> What about incorporating new information at runtime? For example,
> thinking of the case of the first-time user who rates 3 things and
> then... waits until the next run of the offline process? That's my
> concern, along these lines.


If you look back at the original formulation, r = (A' A) h where A is the
user-item matrix and h is the (current) user's history, only the A'A part
is computed off-line.  If user history h is updated as you say, then the
recommendations r are also updated without needing the off-line computation.

In practice, as you say there are some more tweaks to be had, especially in
the UI.

I find, for instance, that it is important to rotate the recommendations
relatively frequently.
I usually accomplish that by (partially) randomized jittering.

-- 
Ted Dunning, CTO
DeepDyve

Re: Taste-GenericItemBasedRecommender

Posted by Sean Owen <sr...@gmail.com>.
On Tue, Sep 8, 2009 at 11:16 PM, Ted Dunning<te...@gmail.com> wrote:
> This can be describe using the method that I use by just moving
> parentheses.  The big off-line version of recommendation is (A' A) h while
> the more more on-line version is A' (A h).  The overall shape of the
> computation is the same but the first form allows the majority of the
> computation to be done off-line.
>
> My focus in the past has always been to produce recs really fast (sometimes
> hundreds per second).  As is generally the case, doing as much computation
> ahead of time is a great way to get fast response.

What about incorporating new information at runtime? For example,
thinking of the case of the first-time user who rates 3 things and
then... waits until the next run of the offline process? That's my
concern, along these lines.

I completely agree that, if you can get away with offline computation,
things can be much more efficient.

I see a spectrum of needs developing -- from completely online to
completely offline. I think we need more of the completely-offline
stuff to support cases where there are no such requirements, and scale
dictates it just has to be offline.

> I tend to view requirements for lots of hooks and modifications as an
> indication that the underlying algorithms are not phrased well.  That isn't
> always true, but it often is.  Use of methods that depend on normal
> distribution assumptions often have pathological performance for small
> counts.  This leads to lots of things like "if (n > 5)" sorts of hacks to
> avoid problematic cases.  These methods include most chi-squared techniques,
> correlation based recommendation engines and LSA.  The right answer is to
> avoid the original mistake and use log-likelihood ratio tests, probabilistic
> cooccurrence measures and LDA respectively.

+1, have come to very much agree. Not all 'hooks' are bad though I am
sure a matrix-based approach can accommodate a lot of things like
this.


So maybe let's visit the matrix approach a bit here -- what is A'A? is
this the similarity matrix? working backwards, seems like that's the
thing to left-multiply with the user rating vector to get
recommendations.

The first question I have is how does this cope with missing elements?
I understand the idea is to use a sparse representation, but, the
nature of the computation means these elements will be treated as
zero, which doesn't work.

Re: Taste-GenericItemBasedRecommender

Posted by Ted Dunning <te...@gmail.com>.
Taking your comments out of order.

On Tue, Sep 8, 2009 at 2:41 PM, Sean Owen <sr...@gmail.com> wrote:

>  And yeah my world view is not centered around big off-line
> computations. In that world, yes I bet this is a good way to compute
> all recommendations at once. I still maintain that interesting use
> cases involve roughly real-time updates and recommendations.
>

This can be describe using the method that I use by just moving
parentheses.  The big off-line version of recommendation is (A' A) h while
the more more on-line version is A' (A h).  The overall shape of the
computation is the same but the first form allows the majority of the
computation to be done off-line.

My focus in the past has always been to produce recs really fast (sometimes
hundreds per second).  As is generally the case, doing as much computation
ahead of time is a great way to get fast response.


> I think reducing it to a matrix multiplication problem is tidy, and
> works I am sure.
>

This approach has been good for me for the last decade of building rec
engines.

I don't know if, practically, it affords the tweaks
> and hooks and modifications that let some of these stock algos
> actually produce good results. It can probably be made to work too.
>

As I tried to say in my original comment, I am only describing the shape or
pattern for the computation.  There are lots of tweaks that can be added
without changing this shape.  Also, there are lots of ways of implementing
the general pattern.  One of my favorites is to use a text retrieval engine
for the matrix-vector multiplication part and a map-reduce engine for the
matrix-matrix part.

I tend to view requirements for lots of hooks and modifications as an
indication that the underlying algorithms are not phrased well.  That isn't
always true, but it often is.  Use of methods that depend on normal
distribution assumptions often have pathological performance for small
counts.  This leads to lots of things like "if (n > 5)" sorts of hacks to
avoid problematic cases.  These methods include most chi-squared techniques,
correlation based recommendation engines and LSA.  The right answer is to
avoid the original mistake and use log-likelihood ratio tests, probabilistic
cooccurrence measures and LDA respectively.

As another example, in text retrieval, stop words are often used because
results for many retrieval engines are sooo bad without them.  Really,
though, this indicates a failure of the scoring algorithms, the phrase
recognition algorithms and possibly the index format.  All of these can be
fixed.  Whether it is worth fixing is another question, but the inference
from the need for stop lists that the fundamentals are somewhat broken is
still valid.

My gut is that it is not the most efficient way to do it -- if only
> because it definitely is not efficient to compute this to produce
> *one* set of recs, or, to answer many of the questions one asks of a
> recommender, which is not just to produce recs.
>

Efficiency is a tricky thing here.  Because the off-line and on-line forms
produce the same computation, and because the majority of the work is in the
off-line part, it is a definite win for response time to do off-line work.

But there is much more to it than that.  The off-line form of the
computation can be done using very clean, very fast sequential accesses to
the data.  The on-line form requires lots of random accesses to data.  That
difference makes the computation whalingly faster to do off-line.  In my
experience "whalingly" can be three orders of magnitude which is a huge
factor.

Even more important, however, is that there are lots of streamlinings that
can be done if you have global information and these can save many more
orders of magnitude in speed.  For instance, suppose that you have some item
that 50 million people have interacted with.  Do you think that you will
have any noticeable accuracy loss if you consider only 1 million of those
interactions?  Or even if you only consider a few thousand of them?  The
correct answer is that you can randomly downsample to about a thousand
interactions with no perceptible difference in accuracy.  This has a huge
impact because the cost of computing the coocurrence counts between two
items goes up with the product of the prevalence of each item.  If you
decrease each by a factor of a thousand, you have decreased the necessary
computation by a million.  Unfortunately, the overall prevalence is a global
property that is much easier to deal with off-line.

Similarly, sparsification by statistical means is trivial off-line and
difficult on-line.  This can, again, save you orders of magnitude in cost at
final recommendation time.

As a concrete example of just how massive these speedups are, I worked on
one system that had about 100-200 million interactions per day with about 10
million items.  We used 7 months of history to generate recommendations.
This included about 7 x 30 x 100-200 M = 20-40 billion observations.  Using
the techniques I have described we were able to build a system that did well
over 100 recommendations per second.  The off-line computation could be done
relatively easily in about 10 hours and would take less with more recent
versions of Pig and Hadoop.  Total hardware included a tiny hadoop cluster
with 10-20 cores and about 40GBytes of RAM for the off-line portion and a
few real-time servers.  With more modern machines, this entire operation
could have been done with 2-4 1U servers.  The recommendation quality was
excellent.  The system captured changes in recommendations much faster than
they actually occurred.

That may not have been the *most* efficient way to solve the problem, but it
was efficient enough to be hard to beat.

-- 
Ted Dunning, CTO
DeepDyve

Re: Taste-GenericItemBasedRecommender

Posted by Sean Owen <sr...@gmail.com>.
On Tue, Sep 8, 2009 at 10:11 PM, Ted Dunning<te...@gmail.com> wrote:
> In this case, the interaction matrix A is turned into a recommendation
> matrix using matrix multiplication: A' A.  To implement recommendation given
> a user history u, we recommend the items that have the largest values in (A'
> A) u.

One initial comment I'd make is that it's not generally true that you
want to compute all recs at once, which is the direction you're going?

I think reducing it to a matrix multiplication problem is tidy, and
works I am sure. I don't know if, practically, it affords the tweaks
and hooks and modifications that let some of these stock algos
actually produce good results. It can probably be made to work too.

My gut is that it is not the most efficient way to do it -- if only
because it definitely is not efficient to compute this to produce
*one* set of recs, or, to answer many of the questions one asks of a
recommender, which is not just to produce recs.

And yeah my world view is not centered around big off-line
computations. In that world, yes I bet this is a good way to compute
all recommendations at once. I still maintain that interesting use
cases involve roughly real-time updates and recommendations.

I must disclaim that I simply don't understand this approach enough,
and haven't tried it at all, so have not much basis for reasoning
about it.

But, rest assured that even if the implications for this issue in your
picture of how this might be implemented implies that someone didn't
really think through data structures, I don't think it carries the
same implication for how it's implemented now. I do think the current
implementation is roughly right for the questions that need to be
answered. This is really a case of 'cheating' by relaxing an
assumption to gain performance -- but one that is obviously worth it
-- rather than fixing a fundamental flaw of some sort.

Re: Taste-GenericItemBasedRecommender

Posted by Ted Dunning <te...@gmail.com>.
The data structure does bear on this, actually.

You and I see these algorithms a bit differently, I know, which may explain
some of our difference in whether this comes from the data structures.

I see essentially all recommendation algorithms as being patterned on matrix
multiplication of the user x item interaction matrix.  The simplest case is
pure item-based recommendation with no fancy stats for sparsification or
frequency weighting.

In this case, the interaction matrix A is turned into a recommendation
matrix using matrix multiplication: A' A.  To implement recommendation given
a user history u, we recommend the items that have the largest values in (A'
A) u.

If A is sparse, then so should be A' A, although to a lesser degree.
Typically, u will also be sparse.  This means that the computation of the
recommendation using (A' A) u should use entirely sparse algorithms.

So which items will be considered for recommendation at recommendation-time?

The answer is that A'A will have non-zero elements wherever a row (user) of
A has two non-zero items.  Wherever one of these items is an item in u, then
the other item will be considered for recommendation.

Does this make sense?

In practice, of course, (A'A) won't be used directly, but instead will be
further sparsified or replaced by some decomposition into an alternate
coordinate representation.

On Tue, Sep 8, 2009 at 1:26 PM, Sean Owen <sr...@gmail.com> wrote:

> Not sure it has a lot to do with data structures? It's not, well,
> obvious that one doesn't need to consider every item for
> recommendation -- that's the canonical way the algorithms work. In
> practice, there's a shortcut.
>
> Technically, it's kind of wrong. There's nothing about recommender
> engines per se that dictate you couldn't recommend items that aren't
> connected by some co-occurrence. But in practice, it's pretty fine to
> take the shortcut.
>
> This is probably something my brain should have gotten to earlier,
> but, not really sure it is suggestive of another problem? unless I
> miss something.
>
> On Tue, Sep 8, 2009 at 9:15 PM, Ted Dunning<te...@gmail.com> wrote:
> > These issues should have been handled pretty transparently by the sparse
> > data structures.
> >
> > They they were not is a red flag that there should be a bit of thought
> > applied here.
> >
> > On Tue, Sep 8, 2009 at 5:44 AM, Sean Owen <sr...@gmail.com> wrote:
> >
> >> I wanted the user base to take note of the change Gokhan suggests
> >> below. I committed a variant of it just now which does indeed notably
> >> speed up most algorithms by more intelligently selecting possibilities
> >> to consider. On one test I am playing with it sped things up by 50% --
> >> in another, more like 400%. Depending on your data this could be a big
> >> win.
> >>
> >> Sean
> >>
> >> On Mon, Sep 7, 2009 at 2:03 PM, Gökhan Çapan<gk...@gmail.com> wrote:
> >> > Hi, Sean.
> >> > I think we talked about mostSimilarItems( ) function before, about a
> bug
> >> in
> >> > ItemBasedRecommender.
> >> > I think there is another issue, about performance.
> >> >
> >> > mostSimilarItems function gives the list of most similar items to a
> given
> >> > item.
> >> > In computation of those items, the algorithm looks at all other items
> in
> >> > data model, but if there is no user that doesn't rate 2 items together
> it
> >> is
> >> > needless to look if there is a similarity between active item and that
> >> item.
> >> >
> >> >
> >> >
> >> > That is the original function that returns most similar items list in
> >> > cf.taste.impl.recommender.GenericItemBasedRecommender:
> >> >
> >> >  private List<RecommendedItem> doMostSimilarItems(long itemID,
> >> >                                                    int howMany,
> >> >
> >> TopItems.Estimator<Long>
> >> > estimator) throws TasteException {
> >> >     DataModel model = getDataModel();
> >> >     FastIDSet allItemIDs = new FastIDSet(model.getNumItems());
> >> >     LongPrimitiveIterator it = model.getItemIDs();
> >> >
> >> >
> >> >     while (it.hasNext()) {
> >> >       allItemIDs.add(it.nextLong());
> >> >     }
> >> >     allItemIDs.remove(itemID);
> >> >     return TopItems.getTopItems(howMany, allItemIDs.iterator(), null,
> >> > estimator);
> >> >   }
> >> >
> >> >
> >> >
> >> >
> >> > I updated and use it that way:
> >> >  private List<RecommendedItem> doMostSimilarItems(long itemID,
> >> >                                                    int howMany,
> >> >
> >> TopItems.Estimator<Long>
> >> > estimator) throws TasteException {
> >> >     DataModel model = getDataModel();
> >> >
> >> >       FastIDSet set=new FastIDSet();
> >> >       PreferenceArray arr=model.getPreferencesForItem(itemID);
> >> >       for(int i=0;i<arr.length();i++){
> >> >
> set.addAll(model.getItemIDsFromUser(arr.get(i).getUserID()));
> >> >       }
> >> >       set.remove(itemID);
> >> >       return
> TopItems.getTopItems(howMany,set.iterator(),null,estimator);
> >> >   }
> >> >
> >> >
> >> >
> >> > The only difference between two function is:
> >> > the original one passes all items to getTopItems
> >> > mine passes only the items that have at least one user who've rated
> both
> >> > active item and that item.
> >> >
> >> >
> >> >
> >> > This little change made the algorithm pretty faster
> >> > (For my data set it runs 4 times faster now.)
> >> >
> >> > I wanted to inform you, if you want to try and update the code.
> >> > If for another reason original version of the code is better, please
> make
> >> me
> >> > know.
> >> >
> >> >
> >> >
> >> >
> >> >
> >>
> >
> >
> >
> > --
> > Ted Dunning, CTO
> > DeepDyve
> >
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Taste-GenericItemBasedRecommender

Posted by Sean Owen <sr...@gmail.com>.
Not sure it has a lot to do with data structures? It's not, well,
obvious that one doesn't need to consider every item for
recommendation -- that's the canonical way the algorithms work. In
practice, there's a shortcut.

Technically, it's kind of wrong. There's nothing about recommender
engines per se that dictate you couldn't recommend items that aren't
connected by some co-occurrence. But in practice, it's pretty fine to
take the shortcut.

This is probably something my brain should have gotten to earlier,
but, not really sure it is suggestive of another problem? unless I
miss something.

On Tue, Sep 8, 2009 at 9:15 PM, Ted Dunning<te...@gmail.com> wrote:
> These issues should have been handled pretty transparently by the sparse
> data structures.
>
> They they were not is a red flag that there should be a bit of thought
> applied here.
>
> On Tue, Sep 8, 2009 at 5:44 AM, Sean Owen <sr...@gmail.com> wrote:
>
>> I wanted the user base to take note of the change Gokhan suggests
>> below. I committed a variant of it just now which does indeed notably
>> speed up most algorithms by more intelligently selecting possibilities
>> to consider. On one test I am playing with it sped things up by 50% --
>> in another, more like 400%. Depending on your data this could be a big
>> win.
>>
>> Sean
>>
>> On Mon, Sep 7, 2009 at 2:03 PM, Gökhan Çapan<gk...@gmail.com> wrote:
>> > Hi, Sean.
>> > I think we talked about mostSimilarItems( ) function before, about a bug
>> in
>> > ItemBasedRecommender.
>> > I think there is another issue, about performance.
>> >
>> > mostSimilarItems function gives the list of most similar items to a given
>> > item.
>> > In computation of those items, the algorithm looks at all other items in
>> > data model, but if there is no user that doesn't rate 2 items together it
>> is
>> > needless to look if there is a similarity between active item and that
>> item.
>> >
>> >
>> >
>> > That is the original function that returns most similar items list in
>> > cf.taste.impl.recommender.GenericItemBasedRecommender:
>> >
>> >  private List<RecommendedItem> doMostSimilarItems(long itemID,
>> >                                                    int howMany,
>> >
>> TopItems.Estimator<Long>
>> > estimator) throws TasteException {
>> >     DataModel model = getDataModel();
>> >     FastIDSet allItemIDs = new FastIDSet(model.getNumItems());
>> >     LongPrimitiveIterator it = model.getItemIDs();
>> >
>> >
>> >     while (it.hasNext()) {
>> >       allItemIDs.add(it.nextLong());
>> >     }
>> >     allItemIDs.remove(itemID);
>> >     return TopItems.getTopItems(howMany, allItemIDs.iterator(), null,
>> > estimator);
>> >   }
>> >
>> >
>> >
>> >
>> > I updated and use it that way:
>> >  private List<RecommendedItem> doMostSimilarItems(long itemID,
>> >                                                    int howMany,
>> >
>> TopItems.Estimator<Long>
>> > estimator) throws TasteException {
>> >     DataModel model = getDataModel();
>> >
>> >       FastIDSet set=new FastIDSet();
>> >       PreferenceArray arr=model.getPreferencesForItem(itemID);
>> >       for(int i=0;i<arr.length();i++){
>> >           set.addAll(model.getItemIDsFromUser(arr.get(i).getUserID()));
>> >       }
>> >       set.remove(itemID);
>> >       return TopItems.getTopItems(howMany,set.iterator(),null,estimator);
>> >   }
>> >
>> >
>> >
>> > The only difference between two function is:
>> > the original one passes all items to getTopItems
>> > mine passes only the items that have at least one user who've rated both
>> > active item and that item.
>> >
>> >
>> >
>> > This little change made the algorithm pretty faster
>> > (For my data set it runs 4 times faster now.)
>> >
>> > I wanted to inform you, if you want to try and update the code.
>> > If for another reason original version of the code is better, please make
>> me
>> > know.
>> >
>> >
>> >
>> >
>> >
>>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Re: Taste-GenericItemBasedRecommender

Posted by Ted Dunning <te...@gmail.com>.
These issues should have been handled pretty transparently by the sparse
data structures.

They they were not is a red flag that there should be a bit of thought
applied here.

On Tue, Sep 8, 2009 at 5:44 AM, Sean Owen <sr...@gmail.com> wrote:

> I wanted the user base to take note of the change Gokhan suggests
> below. I committed a variant of it just now which does indeed notably
> speed up most algorithms by more intelligently selecting possibilities
> to consider. On one test I am playing with it sped things up by 50% --
> in another, more like 400%. Depending on your data this could be a big
> win.
>
> Sean
>
> On Mon, Sep 7, 2009 at 2:03 PM, Gökhan Çapan<gk...@gmail.com> wrote:
> > Hi, Sean.
> > I think we talked about mostSimilarItems( ) function before, about a bug
> in
> > ItemBasedRecommender.
> > I think there is another issue, about performance.
> >
> > mostSimilarItems function gives the list of most similar items to a given
> > item.
> > In computation of those items, the algorithm looks at all other items in
> > data model, but if there is no user that doesn't rate 2 items together it
> is
> > needless to look if there is a similarity between active item and that
> item.
> >
> >
> >
> > That is the original function that returns most similar items list in
> > cf.taste.impl.recommender.GenericItemBasedRecommender:
> >
> >  private List<RecommendedItem> doMostSimilarItems(long itemID,
> >                                                    int howMany,
> >
> TopItems.Estimator<Long>
> > estimator) throws TasteException {
> >     DataModel model = getDataModel();
> >     FastIDSet allItemIDs = new FastIDSet(model.getNumItems());
> >     LongPrimitiveIterator it = model.getItemIDs();
> >
> >
> >     while (it.hasNext()) {
> >       allItemIDs.add(it.nextLong());
> >     }
> >     allItemIDs.remove(itemID);
> >     return TopItems.getTopItems(howMany, allItemIDs.iterator(), null,
> > estimator);
> >   }
> >
> >
> >
> >
> > I updated and use it that way:
> >  private List<RecommendedItem> doMostSimilarItems(long itemID,
> >                                                    int howMany,
> >
> TopItems.Estimator<Long>
> > estimator) throws TasteException {
> >     DataModel model = getDataModel();
> >
> >       FastIDSet set=new FastIDSet();
> >       PreferenceArray arr=model.getPreferencesForItem(itemID);
> >       for(int i=0;i<arr.length();i++){
> >           set.addAll(model.getItemIDsFromUser(arr.get(i).getUserID()));
> >       }
> >       set.remove(itemID);
> >       return TopItems.getTopItems(howMany,set.iterator(),null,estimator);
> >   }
> >
> >
> >
> > The only difference between two function is:
> > the original one passes all items to getTopItems
> > mine passes only the items that have at least one user who've rated both
> > active item and that item.
> >
> >
> >
> > This little change made the algorithm pretty faster
> > (For my data set it runs 4 times faster now.)
> >
> > I wanted to inform you, if you want to try and update the code.
> > If for another reason original version of the code is better, please make
> me
> > know.
> >
> >
> >
> >
> >
>



-- 
Ted Dunning, CTO
DeepDyve