You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by "m3.sharma" <sh...@umn.edu> on 2014/07/18 00:32:39 UTC

Large scale ranked recommendation

Hi,

I am trying to develop a recommender system for about 1 million users and 10
thousand items. Currently it's a simple regression based model where for
every user, item pair in dataset we generate some features and learn model
from it. Till training and evaluation everything is fine the bottleneck is
prediction and ranking for deployment, as at the end of day we need to
recommend each user top 10 personalized items. To do this for every user I
need to use model to predict his rating/preference on all items and take top
10 items from list. Hence after learning the model I need to do 10K X
1million predictions (model.predict(featureVector)).    

Currently I have the following process, feature vectors are sparse and of
length ~300 each.
*1. userFeatures:RDD[(Int, Vector)] , itemFeatures:RDD[(Int, Vector)]*
I do cartesian product of above to generate every user, item combination and
corresponding feature:
*2. val allUIFeat:RDD[(Int, Int, Vector)] =
userFeatures.cartesian(itemFeatures).map(...)*
Then I use the model to do prediction as follow:
*3. val allUIPred:RDD[(Int, Int, Double)] = allUIFeat.map{x => (x._1, x._2,
model.predict(x._3))}*
*4. Then we do group by user and sort to get top 10 items.*

We are not able to complete step 3 above, its taking a really long time
(~5hrs) to get all the predictions which is really long considering we
already have the model and it just needs to do some computation for
prediction. I have tried partitioning  userFeatures across 800 partitions
before doing above steps, still it was of no help.

I am using about 100 executor , 2 core, each executor with 2gb RAM.

Are there any suggestions to make these predictions fast?






--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Large-scale-ranked-recommendation-tp10098.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Large scale ranked recommendation

Posted by xenocyon <ap...@gmail.com>.

(following up a rather old thread:)

Hi Christopher,

I understand how you might use nearest neighbors for item-item
recommendations, but how do you use it for top N items per user?

Thanks!

Apu



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Large-scale-ranked-recommendation-tp10098p25913.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Large scale ranked recommendation

Posted by "m3.sharma" <sh...@umn.edu>.

Christopher, that's really a great idea to search in latent factor space
rather than computing each entry of matrix, now the complexity of the
problem has reduced drastically from naive O(n*m). Since our data is not
that huge I will try exact nbrhood search then fallback to approximate if
that don't work. I will look into annoy. Thanks.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Large-scale-ranked-recommendation-tp10098p10212.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Large scale ranked recommendation

Posted by Christopher Johnson <ch...@gmail.com>.

If you are performing recommendations via a latent factor model then I
highly recommend you look into methods of "approximate nearest neighbors".
 At Spotify we batch process top N recommendations for 40M users with a
catalog of > 40M items, but we avoid the naive O(n*m) process you are
describing by performing an approximate nearest neighbors search.  There
are a bunch of open source packages you can use including our own
https://github.com/spotify/annoy which uses random projections in your
latent factor space to build a forest of trees with constant time nearest
neighbors lookup.

On Fri, Jul 18, 2014 at 1:57 PM, Nick Pentreath <ni...@gmail.com>
wrote:

> Agree GPUs may be interesting for this kind of massively parallel linear
> algebra on reasonable size vectors.
>
> These projects might be of interest in this regard:
> https://github.com/BIDData/BIDMach
> https://github.com/BIDData/BIDMat
> https://github.com/dlwh/gust
>
> Nick
>
>
>
> On Fri, Jul 18, 2014 at 7:40 PM, m3.sharma <sh...@umn.edu> wrote:
>
>> Thanks Nick real-time suggestion is good, will see if we can add that to
>> our
>> deployment strategy and you are correct we may not need recommendation for
>> each user.
>>
>> Will try adding more resources and broadcasting item features suggestion
>> as
>> currently they don't seem to be huge.
>>
>> As users and items both will continue to grow in future for faster vector
>> computations I think few GPU nodes will suffice to serve faster
>> recommendation after learning model with SPARK. It will be great to have
>> builtin GPU support in SPARK for faster computations to leverage GPU
>> capability of nodes for performing these flops faster.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Large-scale-ranked-recommendation-tp10098p10183.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>

Re: Large scale ranked recommendation

Posted by Nick Pentreath <ni...@gmail.com>.

Agree GPUs may be interesting for this kind of massively parallel linear
algebra on reasonable size vectors.

These projects might be of interest in this regard:
https://github.com/BIDData/BIDMach
https://github.com/BIDData/BIDMat
https://github.com/dlwh/gust

Nick



On Fri, Jul 18, 2014 at 7:40 PM, m3.sharma <sh...@umn.edu> wrote:

> Thanks Nick real-time suggestion is good, will see if we can add that to
> our
> deployment strategy and you are correct we may not need recommendation for
> each user.
>
> Will try adding more resources and broadcasting item features suggestion as
> currently they don't seem to be huge.
>
> As users and items both will continue to grow in future for faster vector
> computations I think few GPU nodes will suffice to serve faster
> recommendation after learning model with SPARK. It will be great to have
> builtin GPU support in SPARK for faster computations to leverage GPU
> capability of nodes for performing these flops faster.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Large-scale-ranked-recommendation-tp10098p10183.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: Large scale ranked recommendation

Posted by "m3.sharma" <sh...@umn.edu>.

Thanks Nick real-time suggestion is good, will see if we can add that to our
deployment strategy and you are correct we may not need recommendation for
each user.   

Will try adding more resources and broadcasting item features suggestion as
currently they don't seem to be huge. 

As users and items both will continue to grow in future for faster vector
computations I think few GPU nodes will suffice to serve faster
recommendation after learning model with SPARK. It will be great to have
builtin GPU support in SPARK for faster computations to leverage GPU
capability of nodes for performing these flops faster. 



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Large-scale-ranked-recommendation-tp10098p10183.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Large scale ranked recommendation

Posted by Xiangrui Meng <me...@gmail.com>.

Nick's suggestion is a good approach for your data. The item factors to broadcast should be a few MBs. -Xiangrui

> On Jul 18, 2014, at 12:59 AM, Bertrand Dechoux <de...@gmail.com> wrote:
> 
> And you might want to apply clustering before. It is likely that every user and every item are not unique.
> 
> Bertrand Dechoux
> 
> 
>> On Fri, Jul 18, 2014 at 9:13 AM, Nick Pentreath <ni...@gmail.com> wrote:
>> It is very true that making predictions in batch for all 1 million users against the 10k items will be quite onerous in terms of computation. I have run into this issue too in making batch predictions.
>> 
>> Some ideas:
>> 
>> 1. Do you really need to generate recommendations for each user in batch? How are you serving these recommendations? In most cases, you only need to make recs when a user is actively interacting with your service or product etc. Doing it all in batch tends to be a big waste of computation resources.
>> 
>> In our system for example we are serving them in real time (as a user arrives at a web page, say, our customer hits our API for recs), so we only generate the rec at that time. You can take a look at Oryx for this (https://github.com/cloudera/oryx) though it does not yet support Spark, you may be able to save the model into the correct format in HDFS and have Oryx read the data.
>> 
>> 2. If you do need to make the recs in batch, then I would suggest:
>> (a) because you have few items, I would collect the item vectors and form a matrix.
>> (b) broadcast that matrix
>> (c) do a mapPartitions on the user vectors. Form a user matrix from the vectors in each partition (maybe create quite a few partitions to make each user matrix not too big)
>> (d) do a value call on the broadcasted item matrix
>> (e) now for each partition you have the (small) item matrix and a (larger) user matrix. Do a matrix multiply and you end up with a (U x I) matrix with the scores for each user in the partition. Because you are using BLAS here, it will be significantly faster than individually computed dot products
>> (f) sort the scores for each user and take top K
>> (g) save or collect and do whatever with the scores
>> 
>> 3. in conjunction with (2) you can try throwing more resources at the problem too
>> 
>> If you access the underlying Breeze vectors (I think the toBreeze method is private so you may have to re-implement it), you can do all this using Breeze (e.g. concatenating vectors to make matrices, iterating and whatnot).
>> 
>> Hope that helps
>> 
>> Nick
>> 
>> 
>>> On Fri, Jul 18, 2014 at 1:17 AM, m3.sharma <sh...@umn.edu> wrote:
>>> Yes, thats what prediction should be doing, taking dot products or sigmoid
>>> function for each user,item pair. For 1 million users and 10 K items data
>>> there are 10 billion pairs.
>>> 
>>> 
>>> 
>>> --
>>> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Large-scale-ranked-recommendation-tp10098p10107.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: Large scale ranked recommendation

Posted by Bertrand Dechoux <de...@gmail.com>.

And you might want to apply clustering before. It is likely that every user
and every item are not unique.

Bertrand Dechoux


On Fri, Jul 18, 2014 at 9:13 AM, Nick Pentreath <ni...@gmail.com>
wrote:

> It is very true that making predictions in batch for all 1 million users
> against the 10k items will be quite onerous in terms of computation. I have
> run into this issue too in making batch predictions.
>
> Some ideas:
>
> 1. Do you really need to generate recommendations for each user in batch?
> How are you serving these recommendations? In most cases, you only need to
> make recs when a user is actively interacting with your service or product
> etc. Doing it all in batch tends to be a big waste of computation resources.
>
> In our system for example we are serving them in real time (as a user
> arrives at a web page, say, our customer hits our API for recs), so we only
> generate the rec at that time. You can take a look at Oryx for this (
> https://github.com/cloudera/oryx) though it does not yet support Spark,
> you may be able to save the model into the correct format in HDFS and have
> Oryx read the data.
>
> 2. If you do need to make the recs in batch, then I would suggest:
> (a) because you have few items, I would collect the item vectors and form
> a matrix.
> (b) broadcast that matrix
> (c) do a mapPartitions on the user vectors. Form a user matrix from the
> vectors in each partition (maybe create quite a few partitions to make each
> user matrix not too big)
> (d) do a value call on the broadcasted item matrix
> (e) now for each partition you have the (small) item matrix and a (larger)
> user matrix. Do a matrix multiply and you end up with a (U x I) matrix with
> the scores for each user in the partition. Because you are using BLAS here,
> it will be significantly faster than individually computed dot products
> (f) sort the scores for each user and take top K
> (g) save or collect and do whatever with the scores
>
> 3. in conjunction with (2) you can try throwing more resources at the
> problem too
>
> If you access the underlying Breeze vectors (I think the toBreeze method
> is private so you may have to re-implement it), you can do all this using
> Breeze (e.g. concatenating vectors to make matrices, iterating and whatnot).
>
> Hope that helps
>
> Nick
>
>
> On Fri, Jul 18, 2014 at 1:17 AM, m3.sharma <sh...@umn.edu> wrote:
>
>> Yes, thats what prediction should be doing, taking dot products or sigmoid
>> function for each user,item pair. For 1 million users and 10 K items data
>> there are 10 billion pairs.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Large-scale-ranked-recommendation-tp10098p10107.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>

Re: Large scale ranked recommendation

Posted by Nick Pentreath <ni...@gmail.com>.

It is very true that making predictions in batch for all 1 million users
against the 10k items will be quite onerous in terms of computation. I have
run into this issue too in making batch predictions.

Some ideas:

1. Do you really need to generate recommendations for each user in batch?
How are you serving these recommendations? In most cases, you only need to
make recs when a user is actively interacting with your service or product
etc. Doing it all in batch tends to be a big waste of computation resources.

In our system for example we are serving them in real time (as a user
arrives at a web page, say, our customer hits our API for recs), so we only
generate the rec at that time. You can take a look at Oryx for this (
https://github.com/cloudera/oryx) though it does not yet support Spark, you
may be able to save the model into the correct format in HDFS and have Oryx
read the data.

2. If you do need to make the recs in batch, then I would suggest:
(a) because you have few items, I would collect the item vectors and form a
matrix.
(b) broadcast that matrix
(c) do a mapPartitions on the user vectors. Form a user matrix from the
vectors in each partition (maybe create quite a few partitions to make each
user matrix not too big)
(d) do a value call on the broadcasted item matrix
(e) now for each partition you have the (small) item matrix and a (larger)
user matrix. Do a matrix multiply and you end up with a (U x I) matrix with
the scores for each user in the partition. Because you are using BLAS here,
it will be significantly faster than individually computed dot products
(f) sort the scores for each user and take top K
(g) save or collect and do whatever with the scores

3. in conjunction with (2) you can try throwing more resources at the
problem too

If you access the underlying Breeze vectors (I think the toBreeze method is
private so you may have to re-implement it), you can do all this using
Breeze (e.g. concatenating vectors to make matrices, iterating and whatnot).

Hope that helps

Nick

On Fri, Jul 18, 2014 at 1:17 AM, m3.sharma <sh...@umn.edu> wrote:

> Yes, thats what prediction should be doing, taking dot products or sigmoid
> function for each user,item pair. For 1 million users and 10 K items data
> there are 10 billion pairs.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Large-scale-ranked-recommendation-tp10098p10107.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: Large scale ranked recommendation

Posted by "m3.sharma" <sh...@umn.edu>.

Yes, thats what prediction should be doing, taking dot products or sigmoid
function for each user,item pair. For 1 million users and 10 K items data
there are 10 billion pairs.  



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Large-scale-ranked-recommendation-tp10098p10107.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Large scale ranked recommendation

Posted by Shuo Xiang <sh...@gmail.com>.

Hi,
  Are you suggesting that taking simple vector dot products or sigmoid
function on 10K * 1M data takes 5hrs?

On Thu, Jul 17, 2014 at 3:59 PM, m3.sharma <sh...@umn.edu> wrote:

> We are using RegressionModels that comes with *mllib* package in SPARK.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Large-scale-ranked-recommendation-tp10098p10103.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: Large scale ranked recommendation

Posted by "m3.sharma" <sh...@umn.edu>.

We are using RegressionModels that comes with *mllib* package in SPARK.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Large-scale-ranked-recommendation-tp10098p10103.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.