You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by lisendong <li...@163.com> on 2015/03/30 16:27:20 UTC

Re: different result from implicit ALS with explicit ALS

hi, xiangrui:
I found the ALS of spark 1.3.0 forget to do checkpoint() in explicit ALS:
the code is :
https://github.com/apache/spark/blob/db34690466d67f9c8ac6a145fddb5f7ea30a8d8d/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala <https://github.com/apache/spark/blob/db34690466d67f9c8ac6a145fddb5f7ea30a8d8d/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala>


the checkpoint is very important in my situation, because my task will produce 1TB shuffle data in each iteration, it the shuffle data is not deleted in each iteration(using checkpoint()), the task will produce 30TB data…


So I change the ALS code, and re-compile by myself, but it seems the checkpoint does not take effects, and the task still occupy 30TB disk… ( I only add two lines to the ALS.scala) :





and the driver’s log seems strange, why the log is printed together...


thank you very much!


> 在 2015年2月26日，下午11:33，163 <li...@163.com> 写道：
> 
> Thank you very much for your opinion:)
> 
> In our case, maybe it 's dangerous to treat un-observed item as negative interaction(although we could give them small confidence, I think they are still incredible...)
> 
> I will do more experiments and give you feedback:)
> 
> Thank you;)
> 
> 
>> 在 2015年2月26日，23:16，Sean Owen <so...@cloudera.com> 写道：
>> 
>> I believe that's right, and is what I was getting at. yes the implicit
>> formulation ends up implicitly including every possible interaction in
>> its loss function, even unobserved ones. That could be the difference.
>> 
>> This is mostly an academic question though. In practice, you have
>> click-like data and should be using the implicit version for sure.
>> 
>> However you can give negative implicit feedback to the model. You
>> could consider no-click as a mild, observed, negative interaction.
>> That is: supply a small negative value for these cases. Unobserved
>> pairs are not part of the data set. I'd be careful about assuming the
>> lack of an action carries signal.
>> 
>>> On Thu, Feb 26, 2015 at 3:07 PM, 163 <li...@163.com> wrote:
>>> oh my god, I think I understood...
>>> In my case, there are three kinds of user-item pairs:
>>> 
>>> Display and click pair(positive pair)
>>> Display but no-click pair(negative pair)
>>> No-display pair(unobserved pair)
>>> 
>>> Explicit ALS only consider the first and the second kinds
>>> But implicit ALS consider all the three kinds of pair(and consider the third
>>> kind as the second pair, because their preference value are all zero and
>>> confidence are all 1)
>>> 
>>> So the result are different. right?
>>> 
>>> Could you please give me some advice, which ALS should I use?
>>> If I use the implicit ALS, how to distinguish the second and the third kind
>>> of pair:)
>>> 
>>> My opinion is in my case, I should use explicit ALS ...
>>> 
>>> Thank you so much
>>> 
>>> 在 2015年2月26日，22:41，Xiangrui Meng <me...@databricks.com> 写道：
>>> 
>>> Lisen, did you use all m-by-n pairs during training? Implicit model
>>> penalizes unobserved ratings, while explicit model doesn't. -Xiangrui
>>> 
>>>> On Feb 26, 2015 6:26 AM, "Sean Owen" <so...@cloudera.com> wrote:
>>>> 
>>>> +user
>>>> 
>>>>> On Thu, Feb 26, 2015 at 2:26 PM, Sean Owen <so...@cloudera.com> wrote:
>>>>> 
>>>>> I think I may have it backwards, and that you are correct to keep the 0
>>>>> elements in train() in order to try to reproduce the same result.
>>>>> 
>>>>> The second formulation is called 'weighted regularization' and is used
>>>>> for both implicit and explicit feedback, as far as I can see in the code.
>>>>> 
>>>>> Hm, I'm actually not clear why these would produce different results.
>>>>> Different code paths are used to be sure, but I'm not yet sure why they
>>>>> would give different results.
>>>>> 
>>>>> In general you wouldn't use train() for data like this though, and would
>>>>> never set alpha=0.
>>>>> 
>>>>>> On Thu, Feb 26, 2015 at 2:15 PM, lisendong <li...@163.com> wrote:
>>>>>> 
>>>>>> I want to confirm the loss function you use (sorry I’m not so familiar
>>>>>> with scala code so I did not understand the source code of mllib)
>>>>>> 
>>>>>> According to the papers :
>>>>>> 
>>>>>> 
>>>>>> in your implicit feedback ALS, the loss function is (ICDM 2008):
>>>>>> 
>>>>>> in the explicit feedback ALS, the loss function is (Netflix 2008):
>>>>>> 
>>>>>> note that besides the difference of confidence parameter Cui, the
>>>>>> regularization is also different.  does your code also has this difference?
>>>>>> 
>>>>>> Best Regards,
>>>>>> Sendong Li
>>>>>> 
>>>>>> 
>>>>>>> 在 2015年2月26日，下午9:42，lisendong <li...@163.com> 写道：
>>>>>>> 
>>>>>>> Hi meng, fotero, sowen:
>>>>>>> 
>>>>>>> I’m using ALS with spark 1.0.0, the code should be:
>>>>>>> 
>>>>>>> https://github.com/apache/spark/blob/branch-1.0/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala
>>>>>>> 
>>>>>>> I think the following two method should produce the same (or near)
>>>>>>> result:
>>>>>>> 
>>>>>>> MatrixFactorizationModel model = ALS.train(ratings.rdd(), 30, 30, 0.01,
>>>>>>> -1, 1);
>>>>>>> 
>>>>>>> MatrixFactorizationModel model = ALS.trainImplicit(ratings.rdd(), 30,
>>>>>>> 30, 0.01, -1, 0, 1);
>>>>>>> 
>>>>>>> the data I used is display log, the format of log is as following:
>>>>>>> 
>>>>>>> user  item  if-click
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> I use 1.0 as score for click pair, and 0 as score for non-click pair.
>>>>>>> 
>>>>>>> in the second method, the alpha is set to zero, so the confidence for
>>>>>>> positive and negative are both 1.0 (right?)
>>>>>>> 
>>>>>>> I think the two method should produce similar result, but the result is
>>>>>>> :  the second method’s result is very bad (the AUC of the first result is
>>>>>>> 0.7, but the AUC of the second result is only 0.61)
>>>>>>> 
>>>>>>> 
>>>>>>> I could not understand why, could you help me?
>>>>>>> 
>>>>>>> 
>>>>>>> Thank you very much!
>>>>>>> 
>>>>>>> Best Regards,
>>>>>>> Sendong Li
>>>>>> 
>>>>>> 
>>>>> 
>>>>

Re: different result from implicit ALS with explicit ALS

Posted by lisendong <li...@163.com>.

I have update my spark source code to 1.3.1.

the checkpoint works well. 

BUT the shuffle data still could not be delete automatically…the disk usage is still 30TB…

I have set the spark.cleaner.referenceTracking.blocking.shuffle to true.

Do you know how to solve my problem?

Sendong Li



> 在 2015年3月31日，上午12:11，Xiangrui Meng <me...@gmail.com> 写道：
> 
> setCheckpointInterval was added in the current master and branch-1.3. Please help check whether it works. It will be included in the 1.3.1 and 1.4.0 release. -Xiangrui
> 
> On Mon, Mar 30, 2015 at 7:27 AM, lisendong <lisendong@163.com <ma...@163.com>> wrote:
> hi, xiangrui:
> I found the ALS of spark 1.3.0 forget to do checkpoint() in explicit ALS:
> the code is :
> https://github.com/apache/spark/blob/db34690466d67f9c8ac6a145fddb5f7ea30a8d8d/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala <https://github.com/apache/spark/blob/db34690466d67f9c8ac6a145fddb5f7ea30a8d8d/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala>
> <PastedGraphic-2.tiff>
> 
> the checkpoint is very important in my situation, because my task will produce 1TB shuffle data in each iteration, it the shuffle data is not deleted in each iteration(using checkpoint()), the task will produce 30TB data…
> 
> 
> So I change the ALS code, and re-compile by myself, but it seems the checkpoint does not take effects, and the task still occupy 30TB disk… ( I only add two lines to the ALS.scala) :
> 
> <PastedGraphic-3.tiff>
> 
> 
> 
> and the driver’s log seems strange, why the log is printed together...
> <PastedGraphic-1.tiff>
> 
> thank you very much!
> 
> 
>> 在 2015年2月26日，下午11:33，163 <lisendong@163.com <ma...@163.com>> 写道：
>> 
>> Thank you very much for your opinion:)
>> 
>> In our case, maybe it 's dangerous to treat un-observed item as negative interaction(although we could give them small confidence, I think they are still incredible...)
>> 
>> I will do more experiments and give you feedback:)
>> 
>> Thank you;)
>> 
>> 
>>> 在 2015年2月26日，23:16，Sean Owen <sowen@cloudera.com <ma...@cloudera.com>> 写道：
>>> 
>>> I believe that's right, and is what I was getting at. yes the implicit
>>> formulation ends up implicitly including every possible interaction in
>>> its loss function, even unobserved ones. That could be the difference.
>>> 
>>> This is mostly an academic question though. In practice, you have
>>> click-like data and should be using the implicit version for sure.
>>> 
>>> However you can give negative implicit feedback to the model. You
>>> could consider no-click as a mild, observed, negative interaction.
>>> That is: supply a small negative value for these cases. Unobserved
>>> pairs are not part of the data set. I'd be careful about assuming the
>>> lack of an action carries signal.
>>> 
>>>> On Thu, Feb 26, 2015 at 3:07 PM, 163 <lisendong@163.com <ma...@163.com>> wrote:
>>>> oh my god, I think I understood...
>>>> In my case, there are three kinds of user-item pairs:
>>>> 
>>>> Display and click pair(positive pair)
>>>> Display but no-click pair(negative pair)
>>>> No-display pair(unobserved pair)
>>>> 
>>>> Explicit ALS only consider the first and the second kinds
>>>> But implicit ALS consider all the three kinds of pair(and consider the third
>>>> kind as the second pair, because their preference value are all zero and
>>>> confidence are all 1)
>>>> 
>>>> So the result are different. right?
>>>> 
>>>> Could you please give me some advice, which ALS should I use?
>>>> If I use the implicit ALS, how to distinguish the second and the third kind
>>>> of pair:)
>>>> 
>>>> My opinion is in my case, I should use explicit ALS ...
>>>> 
>>>> Thank you so much
>>>> 
>>>> 在 2015年2月26日，22:41，Xiangrui Meng <meng@databricks.com <ma...@databricks.com>> 写道：
>>>> 
>>>> Lisen, did you use all m-by-n pairs during training? Implicit model
>>>> penalizes unobserved ratings, while explicit model doesn't. -Xiangrui
>>>> 
>>>>> On Feb 26, 2015 6:26 AM, "Sean Owen" <sowen@cloudera.com <ma...@cloudera.com>> wrote:
>>>>> 
>>>>> +user
>>>>> 
>>>>>> On Thu, Feb 26, 2015 at 2:26 PM, Sean Owen <sowen@cloudera.com <ma...@cloudera.com>> wrote:
>>>>>> 
>>>>>> I think I may have it backwards, and that you are correct to keep the 0
>>>>>> elements in train() in order to try to reproduce the same result.
>>>>>> 
>>>>>> The second formulation is called 'weighted regularization' and is used
>>>>>> for both implicit and explicit feedback, as far as I can see in the code.
>>>>>> 
>>>>>> Hm, I'm actually not clear why these would produce different results.
>>>>>> Different code paths are used to be sure, but I'm not yet sure why they
>>>>>> would give different results.
>>>>>> 
>>>>>> In general you wouldn't use train() for data like this though, and would
>>>>>> never set alpha=0.
>>>>>> 
>>>>>>> On Thu, Feb 26, 2015 at 2:15 PM, lisendong <lisendong@163.com <ma...@163.com>> wrote:
>>>>>>> 
>>>>>>> I want to confirm the loss function you use (sorry I’m not so familiar
>>>>>>> with scala code so I did not understand the source code of mllib)
>>>>>>> 
>>>>>>> According to the papers :
>>>>>>> 
>>>>>>> 
>>>>>>> in your implicit feedback ALS, the loss function is (ICDM 2008):
>>>>>>> 
>>>>>>> in the explicit feedback ALS, the loss function is (Netflix 2008):
>>>>>>> 
>>>>>>> note that besides the difference of confidence parameter Cui, the
>>>>>>> regularization is also different.  does your code also has this difference?
>>>>>>> 
>>>>>>> Best Regards,
>>>>>>> Sendong Li
>>>>>>> 
>>>>>>> 
>>>>>>>> 在 2015年2月26日，下午9:42，lisendong <lisendong@163.com <ma...@163.com>> 写道：
>>>>>>>> 
>>>>>>>> Hi meng, fotero, sowen:
>>>>>>>> 
>>>>>>>> I’m using ALS with spark 1.0.0, the code should be:
>>>>>>>> 
>>>>>>>> https://github.com/apache/spark/blob/branch-1.0/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala <https://github.com/apache/spark/blob/branch-1.0/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala>
>>>>>>>> 
>>>>>>>> I think the following two method should produce the same (or near)
>>>>>>>> result:
>>>>>>>> 
>>>>>>>> MatrixFactorizationModel model = ALS.train(ratings.rdd(), 30, 30, 0.01,
>>>>>>>> -1, 1);
>>>>>>>> 
>>>>>>>> MatrixFactorizationModel model = ALS.trainImplicit(ratings.rdd(), 30,
>>>>>>>> 30, 0.01, -1, 0, 1);
>>>>>>>> 
>>>>>>>> the data I used is display log, the format of log is as following:
>>>>>>>> 
>>>>>>>> user  item  if-click
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> I use 1.0 as score for click pair, and 0 as score for non-click pair.
>>>>>>>> 
>>>>>>>> in the second method, the alpha is set to zero, so the confidence for
>>>>>>>> positive and negative are both 1.0 (right?)
>>>>>>>> 
>>>>>>>> I think the two method should produce similar result, but the result is
>>>>>>>> :  the second method’s result is very bad (the AUC of the first result is
>>>>>>>> 0.7, but the AUC of the second result is only 0.61)
>>>>>>>> 
>>>>>>>> 
>>>>>>>> I could not understand why, could you help me?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Thank you very much!
>>>>>>>> 
>>>>>>>> Best Regards,
>>>>>>>> Sendong Li
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
> 
> 
>  邮件带有附件预览链接，若您转发或回复此邮件时不希望对方预览附件，建议您手动删除链接。
> 共有 3 个附件
> PastedGraphic-2.tiff(48K)
> 极速下载 <http://preview.mail.163.com/xdownload?filename=PastedGraphic-2.tiff&mid=1tbiyBrMDVEAMpbFKgAAsJ&part=3&sign=cca8b2e547991f21222b2755d4e03f4d&time=1427731931&uid=lisendong%40163.com>
> PastedGraphic-1.tiff(139K)
> 极速下载 <http://preview.mail.163.com/xdownload?filename=PastedGraphic-1.tiff&mid=1tbiyBrMDVEAMpbFKgAAsJ&part=4&sign=cca8b2e547991f21222b2755d4e03f4d&time=1427731931&uid=lisendong%40163.com>
> PastedGraphic-3.tiff(81K)
> 极速下载 <http://preview.mail.163.com/xdownload?filename=PastedGraphic-3.tiff&mid=1tbiyBrMDVEAMpbFKgAAsJ&part=5&sign=cca8b2e547991f21222b2755d4e03f4d&time=1427731931&uid=lisendong%40163.com>

Re: different result from implicit ALS with explicit ALS

Posted by Xiangrui Meng <me...@gmail.com>.

setCheckpointInterval was added in the current master and branch-1.3.
Please help check whether it works. It will be included in the 1.3.1 and
1.4.0 release. -Xiangrui

On Mon, Mar 30, 2015 at 7:27 AM, lisendong <li...@163.com> wrote:

> hi, xiangrui:
> I found the ALS of spark 1.3.0 forget to do checkpoint() in explicit ALS:
> the code is :
>
> https://github.com/apache/spark/blob/db34690466d67f9c8ac6a145fddb5f7ea30a8d8d/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala
>
> the checkpoint is very important in my situation, because my task will
> produce 1TB shuffle data in each iteration, it the shuffle data is not
> deleted in each iteration(using checkpoint()), the task will produce 30TB
> data…
>
>
> So I change the ALS code, and re-compile by myself, but it seems the
> checkpoint does not take effects, and the task still occupy 30TB disk… ( I
> only add two lines to the ALS.scala) :
>
>
>
>
> and the driver’s log seems strange, why the log is printed together...
>
> thank you very much!
>
>
> 在 2015年2月26日，下午11:33，163 <li...@163.com> 写道：
>
> Thank you very much for your opinion:)
>
> In our case, maybe it 's dangerous to treat un-observed item as negative
> interaction(although we could give them small confidence, I think they are
> still incredible...)
>
> I will do more experiments and give you feedback:)
>
> Thank you;)
>
>
> 在 2015年2月26日，23:16，Sean Owen <so...@cloudera.com> 写道：
>
> I believe that's right, and is what I was getting at. yes the implicit
> formulation ends up implicitly including every possible interaction in
> its loss function, even unobserved ones. That could be the difference.
>
> This is mostly an academic question though. In practice, you have
> click-like data and should be using the implicit version for sure.
>
> However you can give negative implicit feedback to the model. You
> could consider no-click as a mild, observed, negative interaction.
> That is: supply a small negative value for these cases. Unobserved
> pairs are not part of the data set. I'd be careful about assuming the
> lack of an action carries signal.
>
> On Thu, Feb 26, 2015 at 3:07 PM, 163 <li...@163.com> wrote:
> oh my god, I think I understood...
> In my case, there are three kinds of user-item pairs:
>
> Display and click pair(positive pair)
> Display but no-click pair(negative pair)
> No-display pair(unobserved pair)
>
> Explicit ALS only consider the first and the second kinds
> But implicit ALS consider all the three kinds of pair(and consider the
> third
> kind as the second pair, because their preference value are all zero and
> confidence are all 1)
>
> So the result are different. right?
>
> Could you please give me some advice, which ALS should I use?
> If I use the implicit ALS, how to distinguish the second and the third kind
> of pair:)
>
> My opinion is in my case, I should use explicit ALS ...
>
> Thank you so much
>
> 在 2015年2月26日，22:41，Xiangrui Meng <me...@databricks.com> 写道：
>
> Lisen, did you use all m-by-n pairs during training? Implicit model
> penalizes unobserved ratings, while explicit model doesn't. -Xiangrui
>
> On Feb 26, 2015 6:26 AM, "Sean Owen" <so...@cloudera.com> wrote:
>
> +user
>
> On Thu, Feb 26, 2015 at 2:26 PM, Sean Owen <so...@cloudera.com> wrote:
>
> I think I may have it backwards, and that you are correct to keep the 0
> elements in train() in order to try to reproduce the same result.
>
> The second formulation is called 'weighted regularization' and is used
> for both implicit and explicit feedback, as far as I can see in the code.
>
> Hm, I'm actually not clear why these would produce different results.
> Different code paths are used to be sure, but I'm not yet sure why they
> would give different results.
>
> In general you wouldn't use train() for data like this though, and would
> never set alpha=0.
>
> On Thu, Feb 26, 2015 at 2:15 PM, lisendong <li...@163.com> wrote:
>
> I want to confirm the loss function you use (sorry I’m not so familiar
> with scala code so I did not understand the source code of mllib)
>
> According to the papers :
>
>
> in your implicit feedback ALS, the loss function is (ICDM 2008):
>
> in the explicit feedback ALS, the loss function is (Netflix 2008):
>
> note that besides the difference of confidence parameter Cui, the
> regularization is also different.  does your code also has this difference?
>
> Best Regards,
> Sendong Li
>
>
> 在 2015年2月26日，下午9:42，lisendong <li...@163.com> 写道：
>
> Hi meng, fotero, sowen:
>
> I’m using ALS with spark 1.0.0, the code should be:
>
>
> https://github.com/apache/spark/blob/branch-1.0/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala
>
> I think the following two method should produce the same (or near)
> result:
>
> MatrixFactorizationModel model = ALS.train(ratings.rdd(), 30, 30, 0.01,
> -1, 1);
>
> MatrixFactorizationModel model = ALS.trainImplicit(ratings.rdd(), 30,
> 30, 0.01, -1, 0, 1);
>
> the data I used is display log, the format of log is as following:
>
> user  item  if-click
>
>
>
>
>
>
> I use 1.0 as score for click pair, and 0 as score for non-click pair.
>
> in the second method, the alpha is set to zero, so the confidence for
> positive and negative are both 1.0 (right?)
>
> I think the two method should produce similar result, but the result is
> :  the second method’s result is very bad (the AUC of the first result is
> 0.7, but the AUC of the second result is only 0.61)
>
>
> I could not understand why, could you help me?
>
>
> Thank you very much!
>
> Best Regards,
> Sendong Li
>
>
>
>
>
>
>