You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Antony Mayi (JIRA)" <ji...@apache.org> on 2015/06/30 13:55:05 UTC

[jira] [Comment Edited] (SPARK-8708) MatrixFactorizationModel.predictAll() populates single partition only

    [ https://issues.apache.org/jira/browse/SPARK-8708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14608169#comment-14608169 ] 

Antony Mayi edited comment on SPARK-8708 at 6/30/15 11:55 AM:
--------------------------------------------------------------

The real case is about 13M of users, few hundreds of products and about 500 partitions. The rdd returned by .predictAll() utilizes single partition as in my example (btw. why do you say I have one partition in my toy example? It is using 5 partitions, all of them utilized before it comes to ALS - to me it replicates the real issue I am facing).


was (Author: antonymayi):
The real case is about 13M of users, few hundreds of products and about 500 partitions. The rdd returned by .predictAll() utilizes single partition as in my example (btw. why do you say I have one partition in my toy example? It is using 5 partitions, all of them utilized before it comes to ALS - to me it replicate the real issue I am facing).

> MatrixFactorizationModel.predictAll() populates single partition only
> ---------------------------------------------------------------------
>
>                 Key: SPARK-8708
>                 URL: https://issues.apache.org/jira/browse/SPARK-8708
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 1.3.0
>            Reporter: Antony Mayi
>
> When using mllib.recommendation.ALS the RDD returned by .predictAll() has all values pushed into single partition despite using quite high parallelism.
> This degrades performance of further processing (I can obviously run .partitionBy()) to balance it but that's still too costly (ie if running .predictAll() in loop for thousands of products) and should be possible to do it rather somehow on the model (automatically)).
> Bellow is an example on tiny sample (same on large dataset):
> {code:title=pyspark}
> >>> r1 = (1, 1, 1.0)
> >>> r2 = (1, 2, 2.0)
> >>> r3 = (2, 1, 2.0)
> >>> r4 = (2, 2, 2.0)
> >>> r5 = (3, 1, 1.0)
> >>> ratings = sc.parallelize([r1, r2, r3, r4, r5], 5)
> >>> ratings.getNumPartitions()
> 5
> >>> users = ratings.map(itemgetter(0)).distinct()
> >>> model = ALS.trainImplicit(ratings, 1, seed=10)
> >>> predictions_for_2 = model.predictAll(users.map(lambda u: (u, 2)))
> >>> predictions_for_2.glom().map(len).collect()
> [0, 0, 3, 0, 0]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org