You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Jake Mannix <ja...@gmail.com> on 2010/06/01 00:42:22 UTC

Re: Taste - datamodel

+1 from me on this: anywhere we use JDBC / SQL, we could be using
a noSQL data store more scalably, because I don't *think* we rely on any
aggregate fancy SQL joining or grouping or anything.

One thing I wonder, Sean, is if you used say, Voldemort to store rows
of the ItemSimilarity matrix and the user-item preferences matrix,
computing recommendations for an item-based recommender on the
fly by doing a get() based on all the items a user had rated and then
a multiGet() based on all the keys returned, then recommending in
the usual fashion... it's two remote calls to a key-value store during
the course of producing recommendations, but something like Voldemort
or Cassandra would have their responses available in basically no time
(it's all in memory usually), and from experience, the latency you'd
get would be pretty reasonable.

Seems like it would be a nice integration to try out.  The advantage of
that kind of recommender (on the fly) is that you could change the way
you compute recommendations (ie. the model) on a per-query basis
if need be, and if new items were added to the users list of things
they'd rated, that users' row in the key-value store could be updated
on the fly too (the ItemSimilarity matrix would drift out of date, sure,
but it could be batch updated periodically).

What do you guys think?

  -jake


On Mon, May 31, 2010 at 10:08 AM, Ted Dunning <te...@gmail.com> wrote:

> Yes.  This is clearly feasible.  Everywhere jdbc is used, noSQL could be
> used as well, perhaps to substantial advantage.
>
> On Mon, May 31, 2010 at 9:57 AM, Florent Empis <florent.empis@gmail.com
> >wrote:
>
> > org.apache.mahout.cf.taste.impl.recommender.slopeone.jdbc.*
> > and
> > org.apache.mahoutt.cf.taste.impl.similarity.jdbc.*
> >
> > The data structure used by these classes is very simple, hence I thought
> it
> > might make sense to store them in a process with less other overhead than
> a
> > full blown RDBMS.
> > The advantage of using a Key-pair distributed system for these seemed
> > obvious to me: several nodes providing resilience, and allowing for
> > scalibility on the querying side of things....
> >
>

Re: Taste - datamodel

Posted by Ted Dunning <te...@gmail.com>.
On Mon, May 31, 2010 at 3:42 PM, Jake Mannix <ja...@gmail.com> wrote:

> Voldemort to store rows
> of the ItemSimilarity matrix and the user-item preferences matrix,
> computing recommendations for an item-based recommender on the
> fly by doing a get() based on all the items a user had rated and then
> a multiGet() based on all the keys returned, ...
>
...
> What do you guys think?
>

Been there.  Done that.  Works great.  In fact, for quite a while, we just
used LightHttpd to serve these vectors and didn't even have a multi-get.
 That still worked well.

I agree with Sean that SQL shouldn't be dropped ... this is just an
addition.

Re: Taste - datamodel

Posted by Jake Mannix <ja...@gmail.com>.
On Mon, May 31, 2010 at 4:01 PM, Sean Owen <sr...@gmail.com> wrote:
>
>
> So I don't know that SQL should go away, if only because, for every
> business with Cassandra set up there are 1000 with a database. And for
> every user with huge data, there are 100 with medium-sized data. I
> don't want to lose sight of supporting the common case on the way.
>
> One larger point I'm making, which I don't believe anyone's disputing,
> is that there are more than just huge item-based recommenders in play
> here. What's below makes total sense, for one algorithm, though.
>

I'm certainly not recommending we _drop_ SQL!  Just add support for
noSQL.  Maybe there's a problem with that moniker? :)


> Yes this is more or less how the item-based recommender works now,
> when paired up with a JDBC-backed ItemSimilarity. (It could use a
> "bulk query" method to get all those similarities in one go though;
> not too hard to weave in.)
>
> I'd be a little concerned about whether this fits comfortably in
> memory. The similarity matrix is potentially dense -- big rows -- and
> you're loading one row per item the user has rated. It could get into
> tens of megabytes for one query. The distributed version dares not do
> this. But, worth a try in principle.
>

Nonono, we'd definitely have to make an approximation here: trim down
the ItemSimilarity matrix to be more sparse: either by removing
similarities in a row that are below a certain cutoff (as long as there
are enough entries in that row: if that row only has e.g. 10 nonzero
similarities, keep them all).  So that any call to the data store to get
all of the similarities (all of the closest items with their similarities,
forgetting about the ones farther away) for a given set of items
only return, like Ted suggested, 100 or 1000 * numItemsThisUserRated.
Should be retrieving well under 100KB at a time, hopefully.


> > Seems like it would be a nice integration to try out.  The advantage of
> > that kind of recommender (on the fly) is that you could change the way
> > you compute recommendations (ie. the model) on a per-query basis
> > if need be, and if new items were added to the users list of things
> > they'd rated, that users' row in the key-value store could be updated
> > on the fly too (the ItemSimilarity matrix would drift out of date, sure,
> > but it could be batch updated periodically).
>
> +1 for someone to try it. I'm currently more interested in the Hadoop
> end of the spectrum myself.
>

Yep, I hear ya.  I'm just trying to chime in with a nice "+1" to encouraging
any patches which allow for this kind of thing.  Someday I'll have the
time do look at these kinds of fun additions myself, but it isn't quite now.

  -jake

Re: Taste - datamodel

Posted by Sean Owen <sr...@gmail.com>.
It sounds fine. At the moment the distributed recommender caps the
number of co-occurrences (similarities) recorded at 100 per item. I
should make that configurable. And it takes the top 10 user
preferences by magnitude when making a recommendation. You might use
the same defaults but also make them configurable.

On Tue, Jun 1, 2010 at 6:56 AM, Sebastian Schelter
<se...@zalando.de> wrote:
> I'm reading this discussion with great interest.
>
> As you stress the importance of keeping the item-similarity-matrix sparse, I
> think it would be a useful improvement to add an option like
> "maxSimilaritiesPerItem" to
> o.a.m.cf.taste.hadoop.similarity.item.ItemSimilarityJob, which would make it
> try to cut down the number of similar items per item.
>
> However as we store each similarity pair only once it could happen that
> there are more than "maxSimilaritiesPerItem" similar items for a single item
> as we can't drop some of the pairs because the other item in the pair might
> have to little similarities otherwise.
>
> I could add this feature if you agree that its useful this way.
>
> If one wishes to drop similarities below a certain cutoff, this could be
> done in a custom implementation of
> o.a.m.cf.taste.hadoop.similarity.DistributedItemSimilarity by simply
> returning NaN if the computed similarity is below that cutoff value.
>
> -sebastian
>

Re: Taste - datamodel

Posted by Sebastian Schelter <se...@zalando.de>.
I'm reading this discussion with great interest.

As you stress the importance of keeping the item-similarity-matrix sparse, I
think it would be a useful improvement to add an option like
"maxSimilaritiesPerItem" to
o.a.m.cf.taste.hadoop.similarity.item.ItemSimilarityJob, which would make it
try to cut down the number of similar items per item.

However as we store each similarity pair only once it could happen that
there are more than "maxSimilaritiesPerItem" similar items for a single item
as we can't drop some of the pairs because the other item in the pair might
have to little similarities otherwise.

I could add this feature if you agree that its useful this way.

If one wishes to drop similarities below a certain cutoff, this could be
done in a custom implementation of
o.a.m.cf.taste.hadoop.similarity.DistributedItemSimilarity by simply
returning NaN if the computed similarity is below that cutoff value.

-sebastian

2010/6/1 Ted Dunning <te...@gmail.com>

> I normally deal with this by purposefully limiting the length of these
> rows.
>  The argument is that if I never recommend more than 100 items to a person
> (or 20 or 1000 ... the argument doesn't change), then none of the item ->
> item* mappings needs to have more than 100 items since the tail of the list
> can't affect the top 100 recommendations anyway.  It is also useful to
> limit
> the user history to either only recent or only important ratings.  That
> means that a typical big multi-get is something like 100 history items x
> 100
> related items = 10,000 items x 10 bytes for id+score.  This sounds kind of
> big, but the average case is 5x smaller.
>
> On Mon, May 31, 2010 at 4:01 PM, Sean Owen <sr...@gmail.com> wrote:
>
> > I'd be a little concerned about whether this fits comfortably in
> > memory. The similarity matrix is potentially dense -- big rows -- and
> > you're loading one row per item the user has rated. It could get into
> > tens of megabytes for one query. The distributed version dares not do
> > this. But, worth a try in principle.
> >
>

Re: Taste - datamodel

Posted by Ted Dunning <te...@gmail.com>.
I normally deal with this by purposefully limiting the length of these rows.
 The argument is that if I never recommend more than 100 items to a person
(or 20 or 1000 ... the argument doesn't change), then none of the item ->
item* mappings needs to have more than 100 items since the tail of the list
can't affect the top 100 recommendations anyway.  It is also useful to limit
the user history to either only recent or only important ratings.  That
means that a typical big multi-get is something like 100 history items x 100
related items = 10,000 items x 10 bytes for id+score.  This sounds kind of
big, but the average case is 5x smaller.

On Mon, May 31, 2010 at 4:01 PM, Sean Owen <sr...@gmail.com> wrote:

> I'd be a little concerned about whether this fits comfortably in
> memory. The similarity matrix is potentially dense -- big rows -- and
> you're loading one row per item the user has rated. It could get into
> tens of megabytes for one query. The distributed version dares not do
> this. But, worth a try in principle.
>

Re: Taste - datamodel

Posted by Sean Owen <sr...@gmail.com>.
On Mon, May 31, 2010 at 11:42 PM, Jake Mannix <ja...@gmail.com> wrote:
> +1 from me on this: anywhere we use JDBC / SQL, we could be using
> a noSQL data store more scalably, because I don't *think* we rely on any
> aggregate fancy SQL joining or grouping or anything.

It's not too complicated, except for slope-one, where it's pretty
useful to leverage SQL.

So I don't know that SQL should go away, if only because, for every
business with Cassandra set up there are 1000 with a database. And for
every user with huge data, there are 100 with medium-sized data. I
don't want to lose sight of supporting the common case on the way.

One larger point I'm making, which I don't believe anyone's disputing,
is that there are more than just huge item-based recommenders in play
here. What's below makes total sense, for one algorithm, though.


> One thing I wonder, Sean, is if you used say, Voldemort to store rows
> of the ItemSimilarity matrix and the user-item preferences matrix,
> computing recommendations for an item-based recommender on the
> fly by doing a get() based on all the items a user had rated and then
> a multiGet() based on all the keys returned, then recommending in
> the usual fashion... it's two remote calls to a key-value store during
> the course of producing recommendations, but something like Voldemort
> or Cassandra would have their responses available in basically no time
> (it's all in memory usually), and from experience, the latency you'd
> get would be pretty reasonable.

Yes this is more or less how the item-based recommender works now,
when paired up with a JDBC-backed ItemSimilarity. (It could use a
"bulk query" method to get all those similarities in one go though;
not too hard to weave in.)

I'd be a little concerned about whether this fits comfortably in
memory. The similarity matrix is potentially dense -- big rows -- and
you're loading one row per item the user has rated. It could get into
tens of megabytes for one query. The distributed version dares not do
this. But, worth a try in principle.


> Seems like it would be a nice integration to try out.  The advantage of
> that kind of recommender (on the fly) is that you could change the way
> you compute recommendations (ie. the model) on a per-query basis
> if need be, and if new items were added to the users list of things
> they'd rated, that users' row in the key-value store could be updated
> on the fly too (the ItemSimilarity matrix would drift out of date, sure,
> but it could be batch updated periodically).

+1 for someone to try it. I'm currently more interested in the Hadoop
end of the spectrum myself.