You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (JIRA)" <ji...@apache.org> on 2015/06/30 12:09:04 UTC

[jira] [Commented] (SPARK-8708) MatrixFactorizationModel.predictAll() populates single partition only

    [ https://issues.apache.org/jira/browse/SPARK-8708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14608069#comment-14608069 ] 

Apache Spark commented on SPARK-8708:
-------------------------------------

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/7121

> MatrixFactorizationModel.predictAll() populates single partition only
> ---------------------------------------------------------------------
>
>                 Key: SPARK-8708
>                 URL: https://issues.apache.org/jira/browse/SPARK-8708
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 1.3.0
>            Reporter: Antony Mayi
>
> When using mllib.recommendation.ALS the RDD returned by .predictAll() has all values pushed into single partition despite using quite high parallelism.
> This degrades performance of further processing (I can obviously run .partitionBy()) to balance it but that's still too costly (ie if running .predictAll() in loop for thousands of products) and should be possible to do it rather somehow on the model (automatically)).
> Bellow is an example on tiny sample (same on large dataset):
> {code:title=pyspark}
> >>> r1 = (1, 1, 1.0)
> >>> r2 = (1, 2, 2.0)
> >>> r3 = (2, 1, 2.0)
> >>> r4 = (2, 2, 2.0)
> >>> r5 = (3, 1, 1.0)
> >>> ratings = sc.parallelize([r1, r2, r3, r4, r5], 5)
> >>> ratings.getNumPartitions()
> 5
> >>> users = ratings.map(itemgetter(0)).distinct()
> >>> model = ALS.trainImplicit(ratings, 1, seed=10)
> >>> predictions_for_2 = model.predictAll(users.map(lambda u: (u, 2)))
> >>> predictions_for_2.glom().map(len).collect()
> [0, 0, 3, 0, 0]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org