You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Adam Kunicki (JIRA)" <ji...@apache.org> on 2014/11/21 01:05:34 UTC

[jira] [Commented] (SPARK-3541) Improve ALS internal storage

    [ https://issues.apache.org/jira/browse/SPARK-3541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14220267#comment-14220267 ] 

Adam Kunicki commented on SPARK-3541:
-------------------------------------

Has anyone considered that userId: Int and productId: Int may not make sense in most real-life use cases?

It requires an extra mapping of your ids (e.g. Long, or String even) to a space like Int and mapping back before you have any useable information.

> Improve ALS internal storage
> ----------------------------
>
>                 Key: SPARK-3541
>                 URL: https://issues.apache.org/jira/browse/SPARK-3541
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>            Reporter: Xiangrui Meng
>            Assignee: Xiangrui Meng
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> The internal storage of ALS uses many small objects, which increases the GC pressure and makes ALS difficult to scale to very large scale, e.g., 50 billion ratings. In such cases, the full GC may take more than 10 minutes to finish. That is longer than the default heartbeat timeout and hence executors will be removed under default settings.
> We can use primitive arrays to reduce the number of objects significantly. This requires big change to the ALS implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org