You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Roshani Nagmote <ro...@gmail.com> on 2016/09/23 18:07:30 UTC

Spark MLlib ALS algorithm

Hello,

I was working on Spark MLlib ALS Matrix factorization algorithm and came
across the following blog post:

https://databricks.com/blog/2014/07/23/scalable-collaborative-filtering-with-spark-mllib.html

Can anyone help me understanding what "s" scaling factor does and does it
really give better performance? What's the significance of this?
If we convert input data to scaledData with the help of "s", will it
speedup the algorithm?

Scaled data usage:
*(For each user, we create pseudo-users that have the same ratings. That
is, for every rating as (userId, productId, rating), we generate (userId+i,
productId, rating) where 0 <= i < s and s is the scaling factor)*

Also, this blogpost is for spark 1.1 and I am currently using 2.0

Any help will be greatly appreciated.

Thanks,
Roshani

Re: Spark MLlib ALS algorithm

Posted by Roshani Nagmote <ro...@gmail.com>.

Hello,

I ran ALS algorithm on 30 c4.8xlarge machines(60GB RAM each) with
dataset(1.4GB) Netflix dataset (Users: 480189, Items: 17770, Ratings: 99M)

*Command* I run:

/usr/lib/spark/bin/spark-submit --deploy-mode cluster --master yarn  --jars
/usr/lib/spark/examples/jars/scopt_2.11-3.3.0.jar netflixals_2.11-1.0.jar
--rank 200 --numIterations 30 --lambda 5e-3 --kryo s3://netflix_train
s3://netflix_test
I get following *error*:

Job aborted due to stage failure: Task 625 in stage 28.0 failed 4 times,
most recent failure: Lost task 625.3 in stage 28.0 (TID 9362, ip.ec2):
java.io.FileNotFoundException:/mnt/yarn/usercache/hadoop/appcache/application_1474477668615_0164/blockmgr-3d1ef0f7-9c9a-4495-8249-bea38e7dd347/06/shuffle_9_625_0.data.e9330598-330c-4622-afd9-27030c470f8a
(No space left on device)

I did set checkpointdir in S3 and have used checkpoint interval as 5.
Dataset is very small. So, I don't know why it won't run on 30 nodes spark
EMR cluster and it runs out of space.
Can anyone please help me with this?

Thanks,
Roshani


On Fri, Sep 23, 2016 at 11:50 PM, Nick Pentreath <ni...@gmail.com>
wrote:

> The scale factor was only to scale up the number of ratings in the dataset
> for performance testing purposes, to illustrate the scalability of Spark
> ALS.
>
> It is not something you would normally do on your training dataset.
>
> On Fri, 23 Sep 2016 at 20:07, Roshani Nagmote <ro...@gmail.com>
> wrote:
>
>> Hello,
>>
>> I was working on Spark MLlib ALS Matrix factorization algorithm and came
>> across the following blog post:
>>
>> https://databricks.com/blog/2014/07/23/scalable-
>> collaborative-filtering-with-spark-mllib.html
>>
>> Can anyone help me understanding what "s" scaling factor does and does it
>> really give better performance? What's the significance of this?
>> If we convert input data to scaledData with the help of "s", will it
>> speedup the algorithm?
>>
>> Scaled data usage:
>> *(For each user, we create pseudo-users that have the same ratings. That
>> is, for every rating as (userId, productId, rating), we generate (userId+i,
>> productId, rating) where 0 <= i < s and s is the scaling factor)*
>>
>> Also, this blogpost is for spark 1.1 and I am currently using 2.0
>>
>> Any help will be greatly appreciated.
>>
>> Thanks,
>> Roshani
>>
>

Re: Spark MLlib ALS algorithm

Posted by Nick Pentreath <ni...@gmail.com>.

The scale factor was only to scale up the number of ratings in the dataset
for performance testing purposes, to illustrate the scalability of Spark
ALS.

It is not something you would normally do on your training dataset.
On Fri, 23 Sep 2016 at 20:07, Roshani Nagmote <ro...@gmail.com>
wrote:

> Hello,
>
> I was working on Spark MLlib ALS Matrix factorization algorithm and came
> across the following blog post:
>
>
> https://databricks.com/blog/2014/07/23/scalable-collaborative-filtering-with-spark-mllib.html
>
> Can anyone help me understanding what "s" scaling factor does and does it
> really give better performance? What's the significance of this?
> If we convert input data to scaledData with the help of "s", will it
> speedup the algorithm?
>
> Scaled data usage:
> *(For each user, we create pseudo-users that have the same ratings. That
> is, for every rating as (userId, productId, rating), we generate (userId+i,
> productId, rating) where 0 <= i < s and s is the scaling factor)*
>
> Also, this blogpost is for spark 1.1 and I am currently using 2.0
>
> Any help will be greatly appreciated.
>
> Thanks,
> Roshani
>