You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Florent Empis <fl...@gmail.com> on 2010/05/31 17:23:49 UTC

Re: Taste - datamodel

Hi,
Wouldn't it make sense for Taste to be able to use a datamodel based on a
NoSQL of some sort?
I was thinking of Cassandra since :

   - Output of a Hadoop job can be inserted into Cassandra
   - It's classified as an eventual consistency keystore by WP NoSQL page
   - It's an Apache project


I am by no means a NoSQL expert, but I figured it could be a nice project
for me to learn a few things about Taste, and about Cassandra.
The idea I have is basically:

Recommendations  queries  (via, say, TasteWeb) are answered
via Cassandra-based datamodel for recommendation
Regular updates are computed (frequency depends on the volume of data and
the available processing power) and pushed into Cassandra

Since immediate consistency does not matter (for a given item/user,
recommendations won't drastically change between two runs of the
recommendation algorithm. If they do change, the last proposal wasn't
probably "wrong" anyway, the new one is just "better"), an eventually
consistent datastore seems a good choice, since content replication is built
in, thus allowing a strong resilience for the query system...
Let me know if you think it would add some value to Taste....

BR,
Florent

Re: Taste - datamodel

Posted by Jake Mannix <ja...@gmail.com>.

On Mon, May 31, 2010 at 4:01 PM, Sean Owen <sr...@gmail.com> wrote:
>
>
> So I don't know that SQL should go away, if only because, for every
> business with Cassandra set up there are 1000 with a database. And for
> every user with huge data, there are 100 with medium-sized data. I
> don't want to lose sight of supporting the common case on the way.
>
> One larger point I'm making, which I don't believe anyone's disputing,
> is that there are more than just huge item-based recommenders in play
> here. What's below makes total sense, for one algorithm, though.
>

I'm certainly not recommending we _drop_ SQL!  Just add support for
noSQL.  Maybe there's a problem with that moniker? :)

> Yes this is more or less how the item-based recommender works now,
> when paired up with a JDBC-backed ItemSimilarity. (It could use a
> "bulk query" method to get all those similarities in one go though;
> not too hard to weave in.)
>
> I'd be a little concerned about whether this fits comfortably in
> memory. The similarity matrix is potentially dense -- big rows -- and
> you're loading one row per item the user has rated. It could get into
> tens of megabytes for one query. The distributed version dares not do
> this. But, worth a try in principle.
>

Nonono, we'd definitely have to make an approximation here: trim down
the ItemSimilarity matrix to be more sparse: either by removing
similarities in a row that are below a certain cutoff (as long as there
are enough entries in that row: if that row only has e.g. 10 nonzero
similarities, keep them all).  So that any call to the data store to get
all of the similarities (all of the closest items with their similarities,
forgetting about the ones farther away) for a given set of items
only return, like Ted suggested, 100 or 1000 * numItemsThisUserRated.
Should be retrieving well under 100KB at a time, hopefully.

> > Seems like it would be a nice integration to try out.  The advantage of
> > that kind of recommender (on the fly) is that you could change the way
> > you compute recommendations (ie. the model) on a per-query basis
> > if need be, and if new items were added to the users list of things
> > they'd rated, that users' row in the key-value store could be updated
> > on the fly too (the ItemSimilarity matrix would drift out of date, sure,
> > but it could be batch updated periodically).
>
> +1 for someone to try it. I'm currently more interested in the Hadoop
> end of the spectrum myself.
>

Yep, I hear ya.  I'm just trying to chime in with a nice "+1" to encouraging
any patches which allow for this kind of thing.  Someday I'll have the
time do look at these kinds of fun additions myself, but it isn't quite now.

  -jake

Re: Taste - datamodel

Posted by Sean Owen <sr...@gmail.com>.

It sounds fine. At the moment the distributed recommender caps the
number of co-occurrences (similarities) recorded at 100 per item. I
should make that configurable. And it takes the top 10 user
preferences by magnitude when making a recommendation. You might use
the same defaults but also make them configurable.

On Tue, Jun 1, 2010 at 6:56 AM, Sebastian Schelter
<se...@zalando.de> wrote:
> I'm reading this discussion with great interest.
>
> As you stress the importance of keeping the item-similarity-matrix sparse, I
> think it would be a useful improvement to add an option like
> "maxSimilaritiesPerItem" to
> o.a.m.cf.taste.hadoop.similarity.item.ItemSimilarityJob, which would make it
> try to cut down the number of similar items per item.
>
> However as we store each similarity pair only once it could happen that
> there are more than "maxSimilaritiesPerItem" similar items for a single item
> as we can't drop some of the pairs because the other item in the pair might
> have to little similarities otherwise.
>
> I could add this feature if you agree that its useful this way.
>
> If one wishes to drop similarities below a certain cutoff, this could be
> done in a custom implementation of
> o.a.m.cf.taste.hadoop.similarity.DistributedItemSimilarity by simply
> returning NaN if the computed similarity is below that cutoff value.
>
> -sebastian
>

Re: Taste - datamodel

Posted by Sebastian Schelter <se...@zalando.de>.

I'm reading this discussion with great interest.

As you stress the importance of keeping the item-similarity-matrix sparse, I
think it would be a useful improvement to add an option like
"maxSimilaritiesPerItem" to
o.a.m.cf.taste.hadoop.similarity.item.ItemSimilarityJob, which would make it
try to cut down the number of similar items per item.

However as we store each similarity pair only once it could happen that
there are more than "maxSimilaritiesPerItem" similar items for a single item
as we can't drop some of the pairs because the other item in the pair might
have to little similarities otherwise.

I could add this feature if you agree that its useful this way.

If one wishes to drop similarities below a certain cutoff, this could be
done in a custom implementation of
o.a.m.cf.taste.hadoop.similarity.DistributedItemSimilarity by simply
returning NaN if the computed similarity is below that cutoff value.

-sebastian

2010/6/1 Ted Dunning <te...@gmail.com>

> I normally deal with this by purposefully limiting the length of these
> rows.
>  The argument is that if I never recommend more than 100 items to a person
> (or 20 or 1000 ... the argument doesn't change), then none of the item ->
> item* mappings needs to have more than 100 items since the tail of the list
> can't affect the top 100 recommendations anyway.  It is also useful to
> limit
> the user history to either only recent or only important ratings.  That
> means that a typical big multi-get is something like 100 history items x
> 100
> related items = 10,000 items x 10 bytes for id+score.  This sounds kind of
> big, but the average case is 5x smaller.
>
> On Mon, May 31, 2010 at 4:01 PM, Sean Owen <sr...@gmail.com> wrote:
>
> > I'd be a little concerned about whether this fits comfortably in
> > memory. The similarity matrix is potentially dense -- big rows -- and
> > you're loading one row per item the user has rated. It could get into
> > tens of megabytes for one query. The distributed version dares not do
> > this. But, worth a try in principle.
> >
>

Re: Taste - datamodel

Posted by Ted Dunning <te...@gmail.com>.

I normally deal with this by purposefully limiting the length of these rows.
 The argument is that if I never recommend more than 100 items to a person
(or 20 or 1000 ... the argument doesn't change), then none of the item ->
item* mappings needs to have more than 100 items since the tail of the list
can't affect the top 100 recommendations anyway.  It is also useful to limit
the user history to either only recent or only important ratings.  That
means that a typical big multi-get is something like 100 history items x 100
related items = 10,000 items x 10 bytes for id+score.  This sounds kind of
big, but the average case is 5x smaller.

On Mon, May 31, 2010 at 4:01 PM, Sean Owen <sr...@gmail.com> wrote:

> I'd be a little concerned about whether this fits comfortably in
> memory. The similarity matrix is potentially dense -- big rows -- and
> you're loading one row per item the user has rated. It could get into
> tens of megabytes for one query. The distributed version dares not do
> this. But, worth a try in principle.
>

Re: Taste - datamodel

Posted by Sean Owen <sr...@gmail.com>.

On Mon, May 31, 2010 at 11:42 PM, Jake Mannix <ja...@gmail.com> wrote:
> +1 from me on this: anywhere we use JDBC / SQL, we could be using
> a noSQL data store more scalably, because I don't *think* we rely on any
> aggregate fancy SQL joining or grouping or anything.

It's not too complicated, except for slope-one, where it's pretty
useful to leverage SQL.

So I don't know that SQL should go away, if only because, for every
business with Cassandra set up there are 1000 with a database. And for
every user with huge data, there are 100 with medium-sized data. I
don't want to lose sight of supporting the common case on the way.

One larger point I'm making, which I don't believe anyone's disputing,
is that there are more than just huge item-based recommenders in play
here. What's below makes total sense, for one algorithm, though.

> One thing I wonder, Sean, is if you used say, Voldemort to store rows
> of the ItemSimilarity matrix and the user-item preferences matrix,
> computing recommendations for an item-based recommender on the
> fly by doing a get() based on all the items a user had rated and then
> a multiGet() based on all the keys returned, then recommending in
> the usual fashion... it's two remote calls to a key-value store during
> the course of producing recommendations, but something like Voldemort
> or Cassandra would have their responses available in basically no time
> (it's all in memory usually), and from experience, the latency you'd
> get would be pretty reasonable.

Yes this is more or less how the item-based recommender works now,
when paired up with a JDBC-backed ItemSimilarity. (It could use a
"bulk query" method to get all those similarities in one go though;
not too hard to weave in.)

I'd be a little concerned about whether this fits comfortably in
memory. The similarity matrix is potentially dense -- big rows -- and
you're loading one row per item the user has rated. It could get into
tens of megabytes for one query. The distributed version dares not do
this. But, worth a try in principle.

> Seems like it would be a nice integration to try out.  The advantage of
> that kind of recommender (on the fly) is that you could change the way
> you compute recommendations (ie. the model) on a per-query basis
> if need be, and if new items were added to the users list of things
> they'd rated, that users' row in the key-value store could be updated
> on the fly too (the ItemSimilarity matrix would drift out of date, sure,
> but it could be batch updated periodically).

+1 for someone to try it. I'm currently more interested in the Hadoop
end of the spectrum myself.

Re: Taste - datamodel

Posted by Ted Dunning <te...@gmail.com>.

On Mon, May 31, 2010 at 3:42 PM, Jake Mannix <ja...@gmail.com> wrote:

> Voldemort to store rows
> of the ItemSimilarity matrix and the user-item preferences matrix,
> computing recommendations for an item-based recommender on the
> fly by doing a get() based on all the items a user had rated and then
> a multiGet() based on all the keys returned, ...
>
...
> What do you guys think?
>

Been there.  Done that.  Works great.  In fact, for quite a while, we just
used LightHttpd to serve these vectors and didn't even have a multi-get.
 That still worked well.

I agree with Sean that SQL shouldn't be dropped ... this is just an
addition.

Re: Taste - datamodel

Posted by Jake Mannix <ja...@gmail.com>.

+1 from me on this: anywhere we use JDBC / SQL, we could be using
a noSQL data store more scalably, because I don't *think* we rely on any
aggregate fancy SQL joining or grouping or anything.

One thing I wonder, Sean, is if you used say, Voldemort to store rows
of the ItemSimilarity matrix and the user-item preferences matrix,
computing recommendations for an item-based recommender on the
fly by doing a get() based on all the items a user had rated and then
a multiGet() based on all the keys returned, then recommending in
the usual fashion... it's two remote calls to a key-value store during
the course of producing recommendations, but something like Voldemort
or Cassandra would have their responses available in basically no time
(it's all in memory usually), and from experience, the latency you'd
get would be pretty reasonable.

Seems like it would be a nice integration to try out.  The advantage of
that kind of recommender (on the fly) is that you could change the way
you compute recommendations (ie. the model) on a per-query basis
if need be, and if new items were added to the users list of things
they'd rated, that users' row in the key-value store could be updated
on the fly too (the ItemSimilarity matrix would drift out of date, sure,
but it could be batch updated periodically).

What do you guys think?

  -jake

On Mon, May 31, 2010 at 10:08 AM, Ted Dunning <te...@gmail.com> wrote:

> Yes.  This is clearly feasible.  Everywhere jdbc is used, noSQL could be
> used as well, perhaps to substantial advantage.
>
> On Mon, May 31, 2010 at 9:57 AM, Florent Empis <florent.empis@gmail.com
> >wrote:
>
> > org.apache.mahout.cf.taste.impl.recommender.slopeone.jdbc.*
> > and
> > org.apache.mahoutt.cf.taste.impl.similarity.jdbc.*
> >
> > The data structure used by these classes is very simple, hence I thought
> it
> > might make sense to store them in a process with less other overhead than
> a
> > full blown RDBMS.
> > The advantage of using a Key-pair distributed system for these seemed
> > obvious to me: several nodes providing resilience, and allowing for
> > scalibility on the querying side of things....
> >
>

Re: Taste - datamodel

Posted by Ted Dunning <te...@gmail.com>.

Yes.  This is clearly feasible.  Everywhere jdbc is used, noSQL could be
used as well, perhaps to substantial advantage.

On Mon, May 31, 2010 at 9:57 AM, Florent Empis <fl...@gmail.com>wrote:

> org.apache.mahout.cf.taste.impl.recommender.slopeone.jdbc.*
> and
> org.apache.mahoutt.cf.taste.impl.similarity.jdbc.*
>
> The data structure used by these classes is very simple, hence I thought it
> might make sense to store them in a process with less other overhead than a
> full blown RDBMS.
> The advantage of using a Key-pair distributed system for these seemed
> obvious to me: several nodes providing resilience, and allowing for
> scalibility on the querying side of things....
>

Re: Taste - datamodel

Posted by Sean Owen <sr...@gmail.com>.

Yes, precisely, those are the bits that could plausibly rely on
Cassandra, and not DataModel. I should also have clarified this.

I plead ignorance; my knowledge of Cassandra only comes from knowing
how BigTable works in Google. I don't know how to develop against it
appropriately so would welcome anyone with a patch.

On Mon, May 31, 2010 at 5:57 PM, Florent Empis <fl...@gmail.com> wrote:
> The data structure used by these classes is very simple, hence I thought it
> might make sense to store them in a process with less other overhead than a
> full blown RDBMS.
> The advantage of using a Key-pair distributed system for these seemed
> obvious to me: several nodes providing resilience, and allowing for
> scalibility on the querying side of things....

Re: Taste - datamodel

Posted by Florent Empis <fl...@gmail.com>.

I'm afraid I might have a bit unclear (the different layers of Mahout are
still very new to me, my apologizes). What I had in mind was basically to
provide a replacement for, say,
org.apache.mahout.cf.taste.impl.recommender.slopeone.jdbc.*
and
org.apache.mahoutt.cf.taste.impl.similarity.jdbc.*

The data structure used by these classes is very simple, hence I thought it
might make sense to store them in a process with less other overhead than a
full blown RDBMS.
The advantage of using a Key-pair distributed system for these seemed
obvious to me: several nodes providing resilience, and allowing for
scalibility on the querying side of things....

2010/5/31 Ted Dunning <te...@gmail.com>

> Hmm....
>
> I have used Lucene very effectively in item-based recommendation settings
> before and user-based recommendations were marginally acceptable.
>
> All that the data store has to do is fetch the lists of related items for
> the most important few items in the persons history.  With a multi-fetch
> operation (which Cassandra may or may not have), this is one server
> round-trip.  It is definitely much faster to keep lots of items in memory,
> though.
>
> The off-line processing to build the item-item relationships, however,
> would
> require a scan of all user profiles which may be a bit intense, especially
> if the NOSQL store is being used at the same time to service user requests.
>  I have found it preferable to grovel some form of log file in HDFS to get
> this information in the past.
>
> On Mon, May 31, 2010 at 9:00 AM, Sean Owen <sr...@gmail.com> wrote:
>
> > So, any such data store is way too slow to use with a real-time
> > recommender.
> >
> > But a distributed algorithm? sure. As you say, the distributed version
> > runs on Hadoop, and you can transfer between HDFS and Cassandra. Not
> > sure whether to call that integration -- there's nothing the project
> > would meaningfully do here, since it reads off HDFS.
> >
>

Re: Taste - datamodel

Posted by Sean Owen <sr...@gmail.com>.

I suppose I'm referring to the collection of algorithms as implemented
in Mahout. The online / non-distributed algorithms, as a group, have
collective assumptions and requirements that don't fit something like
Cassandra.

You're describing a perfectly valid variant which is partly offline.
You can support this with a JDBCItemSimilarity or
GenericItemSimilarity after pre-computing item-item similarity
somehow, offline. (This is what Sebastian has been doing -- that part
is distributable and is already implemented in Mahout as a Hadoop
job.)

This works fine then with slow-ish data stores like an RDBMS or
something like Cassandra. But this trick won't work for all
algorithms, in general.

So in that sense, yes, you can easily shim in Cassandra here by
implementing an ItemSimilarity. OK yes, that'd be a particular
non-trivial and interesting integration which, if someone wrote and
supported, and perhaps wanted to expand to its logical conclusion,
would be lovely.

On Mon, May 31, 2010 at 5:48 PM, Ted Dunning <te...@gmail.com> wrote:
> Hmm....
>
> I have used Lucene very effectively in item-based recommendation settings
> before and user-based recommendations were marginally acceptable.
>
> All that the data store has to do is fetch the lists of related items for
> the most important few items in the persons history.  With a multi-fetch
> operation (which Cassandra may or may not have), this is one server
> round-trip.  It is definitely much faster to keep lots of items in memory,
> though.
>
> The off-line processing to build the item-item relationships, however, would
> require a scan of all user profiles which may be a bit intense, especially
> if the NOSQL store is being used at the same time to service user requests.
>  I have found it preferable to grovel some form of log file in HDFS to get
> this information in the past.
>
> On Mon, May 31, 2010 at 9:00 AM, Sean Owen <sr...@gmail.com> wrote:
>
>> So, any such data store is way too slow to use with a real-time
>> recommender.
>>
>> But a distributed algorithm? sure. As you say, the distributed version
>> runs on Hadoop, and you can transfer between HDFS and Cassandra. Not
>> sure whether to call that integration -- there's nothing the project
>> would meaningfully do here, since it reads off HDFS.
>>
>

Re: Taste - datamodel

Posted by Ted Dunning <te...@gmail.com>.

Hmm....

I have used Lucene very effectively in item-based recommendation settings
before and user-based recommendations were marginally acceptable.

All that the data store has to do is fetch the lists of related items for
the most important few items in the persons history.  With a multi-fetch
operation (which Cassandra may or may not have), this is one server
round-trip.  It is definitely much faster to keep lots of items in memory,
though.

The off-line processing to build the item-item relationships, however, would
require a scan of all user profiles which may be a bit intense, especially
if the NOSQL store is being used at the same time to service user requests.
 I have found it preferable to grovel some form of log file in HDFS to get
this information in the past.

On Mon, May 31, 2010 at 9:00 AM, Sean Owen <sr...@gmail.com> wrote:

> So, any such data store is way too slow to use with a real-time
> recommender.
>
> But a distributed algorithm? sure. As you say, the distributed version
> runs on Hadoop, and you can transfer between HDFS and Cassandra. Not
> sure whether to call that integration -- there's nothing the project
> would meaningfully do here, since it reads off HDFS.
>

Re: Taste - datamodel

Posted by Sean Owen <sr...@gmail.com>.

So, any such data store is way too slow to use with a real-time recommender.

But a distributed algorithm? sure. As you say, the distributed version
runs on Hadoop, and you can transfer between HDFS and Cassandra. Not
sure whether to call that integration -- there's nothing the project
would meaningfully do here, since it reads off HDFS.

It could be changed to read from Cassandra somehow but my sense is
there would be no real benefit. Cassandra is designed to provided
reasonably fast access to huge data, but that's not the model that the
distributed algorithm needs at all. And it's not suitable for the
non-distributed versions (not even an RDBMS really is).

But sure integrating at arms length -- pushing output to Cassandra and
pulling it in -- makes sense. It's probably something *Cassandra*
supports better than this project, since it's more a question of
Cassandra importing/exporting.

On Mon, May 31, 2010 at 4:23 PM, Florent Empis <fl...@gmail.com> wrote:
> Hi,
> Wouldn't it make sense for Taste to be able to use a datamodel based on a
> NoSQL of some sort?
> I was thinking of Cassandra since :
>
>   - Output of a Hadoop job can be inserted into Cassandra
>   - It's classified as an eventual consistency keystore by WP NoSQL page
>   - It's an Apache project
>
>
> I am by no means a NoSQL expert, but I figured it could be a nice project
> for me to learn a few things about Taste, and about Cassandra.
> The idea I have is basically:
>
> Recommendations  queries  (via, say, TasteWeb) are answered
> via Cassandra-based datamodel for recommendation
> Regular updates are computed (frequency depends on the volume of data and
> the available processing power) and pushed into Cassandra
>
> Since immediate consistency does not matter (for a given item/user,
> recommendations won't drastically change between two runs of the
> recommendation algorithm. If they do change, the last proposal wasn't
> probably "wrong" anyway, the new one is just "better"), an eventually
> consistent datastore seems a good choice, since content replication is built
> in, thus allowing a strong resilience for the query system...
> Let me know if you think it would add some value to Taste....
>
> BR,
> Florent
>