You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Eyal Allweil (JIRA)" <ji...@apache.org> on 2017/01/16 14:57:26 UTC

[jira] [Commented] (SPARK-18781) Allow MatrixFactorizationModel.predict to skip user/product approximation count

    [ https://issues.apache.org/jira/browse/SPARK-18781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15824111#comment-15824111 ] 

Eyal Allweil commented on SPARK-18781:
--------------------------------------

It seems like the approximation count is taking 10-20% of the total running time. When I opened this issue my jobs were taking about an hour, so it was more noticeable - the jobs I've been running lately have been 10-20 minutes, so it "feels" less important, because it's just a few minutes, but it's always at least 10%, usually around 15%.

> Allow MatrixFactorizationModel.predict to skip user/product approximation count
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-18781
>                 URL: https://issues.apache.org/jira/browse/SPARK-18781
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>            Reporter: Eyal Allweil
>            Priority: Minor
>
> When [MatrixFactorizationModel.predict|https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.html#predict(org.apache.spark.rdd.RDD)] is used, it first calculates an approximation count of the users and products in order to determine the most efficient way to proceed. In many cases, the answer to this question is fixed (typically there are more users than products by an order of magnitude) and this check is unnecessary. Adding a parameter to this predict method to allow choosing the implementation (and skipping the check) would be nice.
> It would be especially nice in development cycles when you are repeatedly tweaking your model and which pairs you're predicting for and this approximate count represents a meaningful portion of the time you wait for results.
> I can provide a pull request with this ability added that preserves the existing behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org