You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by redocpot <ju...@gmail.com> on 2014/06/05 17:46:47 UTC

implicit ALS dataSet

Hi,

According to the paper on which MLlib's ALS is based, the model should take
all user-item preferences
as an input, including those which are not related to any input observation
(zero preference).

My question is:

With all positive observations in hand (similar to explicit feedback data
set), should I generate all negative observations in order to make implicit
ALS work with the complete data set (pos union neg) ?

Actually, we test on some data set like:

| user | item | nbPurchase |

nbPurchase is non zero, so we have no negative observations. What we did is
generating all possible user-item with zero nbPurchase to have all possible
user-item pair, but this operation takes some time and storage.

I just want to make sure whether we have to do that with MLlib's ALS ? or it
has already done that ? In that case, I could simply pass only the positive
observation as the explicit ALS does.

Hao.




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/implicit-ALS-dataSet-tp7067.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: implicit ALS dataSet

Posted by redocpot <ju...@gmail.com>.
Hi, 

The real-world dataset is a bit more large, so I tested on the MovieLens
data set, and find the same results:


  	alpha
  	lambda 
  	rank
  	top1
  	top5
  	EPR_in
  	EPR_out


  	40
  	0.001 
  	50
  	297
  	559
  	0.05855
  	0.17299



  	40
  	0.01 
  	50
  	295
  	559
  	0.05854
  	0.17298


  	40
  	0.1 
  	50
  	296
  	560
  	0.05846
  	0.17287


  	40
  	1 
  	50
  	309
  	564
  	0.05819
  	0.17227


  	40
  	25 
  	50
  	287
  	537
  	0.05699
  	0.14855


  	40
  	50 
  	50
  	267
  	496
  	0.05795
  	0.13389


  	40
  	100 
  	50
  	247
  	444
  	0.06504
  	0.11920


  	40
  	200 
  	50
  	145
  	306
  	0.09558
  	0.11388


  	40
  	300 
  	50
  	77
  	178
  	0.11340
  	0.12264



To be clear, there are 1650 items in this movielens data set. Top 1 and Top
5 in the table means the nb of diff items on top1 and top5 according to the
preference list for each user after ALS do the work. Top1, top5, EPR_in are
based on training set. Only EPR_out is on test set. In the top1 and top5,
all items are taken into account, no matter whether it is purchased or not.

The table shows that small lambda( < 1) always leads to over fitting, while
big lambda like 300 removes over fitting but the nb of diff items on the top
1 and top 5 of the preference list is very small (not personalized).





--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/implicit-ALS-dataSet-tp7067p8115.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: implicit ALS dataSet

Posted by Sean Owen <so...@cloudera.com>.
On Thu, Jun 19, 2014 at 3:44 PM, redocpot <ju...@gmail.com> wrote:
> As the paper said, the low ratings will get a low confidence weight, so if I
> understand correctly, these dominant one-timers will be more *unlikely* to
> be recommended comparing to other items whose nbPurchase is bigger.

Correct, yes.


> In fact, lambda is also considered as a potential problem, as in our case,
> the lambda is set to 300, which is confirmed by the test set. Here is test
> result :

Although people use lambda to mean different things in different
places, in every interpretation I've seen, 300 is extremely high :)  1
is very high even.

(alpha = 1 is the lowest value I'd try; it also depends on the data
but sometimes higher values work well. For the data set in the
original paper, they used alpha = 40)


> where EPR_in is given by training set and EPR_out is given by test set. It
> seems 300 is the right lambda, since less overfitting.

I take your point about your results though, hm. Can you at least try
much lower lambda? I'd have to think and speculate about why you might
be observing this effect but a few more data points could help. It may
be that you've forced the model into basically recommending globally
top items, and that does OK as a local minimum, but personalized
recommendations are better still, with a very different lambda.

Also you might consider holding out the most-favored data as the test
data. It biases the test a bit, but at least you are asking whether
the model ranks highly things that are known to rank highly, rather
than any old thing the user interacted with.

Re: implicit ALS dataSet

Posted by redocpot <ju...@gmail.com>.
One thing needs to be mentioned is that, in fact, the schema is (userId,
itemId, nbPurchase), where nbPurchase is equivalent to ratings. I found that
there are many one-timers, which means the pairs whose nbPurchase = 1. The
number of these pairs is about 85% of all positive observations.

As the paper said, the low ratings will get a low confidence weight, so if I
understand correctly, these dominant one-timers will be more *unlikely* to
be recommended comparing to other items whose nbPurchase is bigger.

In fact, lambda is also considered as a potential problem, as in our case,
the lambda is set to 300, which is confirmed by the test set. Here is test
result :

*lambda = 65
EPR_in  = 0.06518592593142056
EPR_out = 0.14789338884259276

lambda = 100
EPR_in  = 0.06619274171311466
EPR_out = 0.13494609978226865

lambda = 300
EPR_in  = 0.08814703345418627
EPR_out = 0.09522125434156471*

where EPR_in is given by training set and EPR_out is given by test set. It
seems 300 is the right lambda, since less overfitting.

Some other parameters are showed in the following code :

*val model = new ALS()
      .setImplicitPrefs(implicitPrefs = true)
      .setAlpha(1) 
      .setLambda(300)
      .setRank(50)
      .setIterations(40)
      .setBlocks(8)
      .setSeed(42)
      .run(ratings_train)*

we set Alpha to 1, since the max nbPurchase is 1396. Not sure if Alpha is
already too big.

 



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/implicit-ALS-dataSet-tp7067p7916.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: implicit ALS dataSet

Posted by Sean Owen <so...@cloudera.com>.
On Thu, Jun 19, 2014 at 3:03 PM, redocpot <ju...@gmail.com> wrote:
> We did some sanity check. For example, each user has his own item list which
> is sorted by preference, then we just pick the top 10 items for each user.
> As a result, we found that there were only 169 different items among the
> (1060080 x 10) items picked, most of them are repeated. That means, given 2
> users, the items recommended might be the same. Nothing is personalized.

This sounds like severe underfitting -- lambda is too high or the
number of features is too small.

Re: implicit ALS dataSet

Posted by redocpot <ju...@gmail.com>.
Hi,

Recently, I have launched a implicit ALS test on a real-world data set.

Initially, we have 2 data set, one is the purchase record during 3 years
past (training set), and the other is the one during 6 months just after the
3 years (test set)

It's a database with 1060080 user and 23880 items.

According the paper based on which MLlib als is implemented, we use expected
percentile rank(EPR) to evaluation the recommendation performance. It shows
a EPR about 8% - 9% which is considered as a good result in the paper.

We did some sanity check. For example, each user has his own item list which
is sorted by preference, then we just pick the top 10 items for each user.
As a result, we found that there were only 169 different items among the
(1060080 x 10) items picked, most of them are repeated. That means, given 2
users, the items recommended might be the same. Nothing is personalized.

It seems that the system is focusing on the best-seller, sth like that. What
we want is to recommended as many different items as possible. That makes
the reco sys more reasonable.

I am not sure if it is a common case for ALS ?

Thanks,

Hao





--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/implicit-ALS-dataSet-tp7067p7912.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: implicit ALS dataSet

Posted by Sean Owen <so...@cloudera.com>.
On Thu, Jun 5, 2014 at 10:38 PM, redocpot <ju...@gmail.com> wrote:
> can be simplified by taking advantage of its algebraic structure, so
> negative observations are not needed. This is what I think at the first time
> I read the paper.

Correct, a big part of the reason that is efficient is because of
sparsity of the input.

> What makes me confused is, after that, the paper (in Discussion section)
> says
>
> "Unlike explicit datasets, here *the model should take all user-item
> preferences as an input, including those which are not related to any input

It is not saying that these non-observations (I would not call them
negative) should explicitly appear in the input. But their implicit
existence can and should be used in the math.

In particular, the loss function that is being minimized is minimizing
error in the implicit "0" cells of the input too, just with much less
weight.

Re: implicit ALS dataSet

Posted by redocpot <ju...@gmail.com>.
Thank you for your quick reply.

As far as I know, the update does not require negative observations, because
the update rule

Xu = (YtCuY + λI)^-1 Yt Cu P(u)

can be simplified by taking advantage of its algebraic structure, so
negative observations are not needed. This is what I think at the first time
I read the paper.

What makes me confused is, after that, the paper (in Discussion section)
says 

"Unlike explicit datasets, here *the model should take all user-item
preferences as an input, including those which are not related to any input
observation (thus hinting to a zero preference).* This is crucial, as the
given observations are inherently biased towards a positive preference, and
thus do not reflect well the user profile. 
However, taking all user-item values as an input to the model raises serious
scalability issues – the number of all those pairs tends to significantly
exceed the input size since a typical user would provide feedback only on a
small fraction of the available items. We address this by exploiting the
algebraic structure of the model, leading to an algorithm that scales
linearly with the input size *while addressing the full scope of user-item
pairs* without resorting to any sub-sampling."

If my understanding is right, it seems that we need negative obs as input,
but we dont use them during the updating. It is strange for me, because that
will generate too many use-time pair, which is not possible.

Thx for the confirmation. I will read the ALS implementation for more
details.

Hao



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/implicit-ALS-dataSet-tp7067p7086.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: implicit ALS dataSet

Posted by Sean Owen <so...@cloudera.com>.
The paper definitely does not suggest that you should include every
user-item pair in the input. The input is by nature extremely sparse,
so literally filling in all the 0s in the input would create
overwhelmingly large input. No, there is no need to do it and it would
be terrible for performance.

As far as I can see, the implementation would correctly handle an
input of 0 and the result would be as if it had not been included at
all, but, that is to say that you do not include implicit 0 input, no.

That's not quite negative input, either figuratively or literally. Are
you trying to figure out how to include actual negative feedback (i.e.
a signal that a user actively does not like an item)? That you do
include if you like, and the implementation is extended from the
original paper to meaningfully handle negative values.

On Thu, Jun 5, 2014 at 4:46 PM, redocpot <ju...@gmail.com> wrote:
> Hi,
>
> According to the paper on which MLlib's ALS is based, the model should take
> all user-item preferences
> as an input, including those which are not related to any input observation
> (zero preference).
>
> My question is:
>
> With all positive observations in hand (similar to explicit feedback data
> set), should I generate all negative observations in order to make implicit
> ALS work with the complete data set (pos union neg) ?
>
> Actually, we test on some data set like:
>
> | user | item | nbPurchase |
>
> nbPurchase is non zero, so we have no negative observations. What we did is
> generating all possible user-item with zero nbPurchase to have all possible
> user-item pair, but this operation takes some time and storage.
>
> I just want to make sure whether we have to do that with MLlib's ALS ? or it
> has already done that ? In that case, I could simply pass only the positive
> observation as the explicit ALS does.
>
> Hao.
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/implicit-ALS-dataSet-tp7067.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.