You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Xiangrui Meng (JIRA)" <ji...@apache.org> on 2014/11/18 01:22:33 UTC

[jira] [Updated] (SPARK-4431) Implement efficient activeIterator for dense and sparse vector

     [ https://issues.apache.org/jira/browse/SPARK-4431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Xiangrui Meng updated SPARK-4431:
---------------------------------
    Target Version/s: 1.2.0

> Implement efficient activeIterator for dense and sparse vector
> --------------------------------------------------------------
>
>                 Key: SPARK-4431
>                 URL: https://issues.apache.org/jira/browse/SPARK-4431
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>            Reporter: DB Tsai
>
> Previously, we were using Breeze's activeIterator to access the non-zero elements in sparse vector, and explicitly skipping the zero in dense/sparse vector using pattern matching. Due to the overhead, we switched back to native `while loop` in #SPARK-4129.
> However, #SPARK-4129 requires de-reference the dv.values/sv.values in each access to the value, and the zeros in dense vector and sparse vector if exist are skipped in the add function call; the overall penalty will be around 10% compared with de-reference once outside the while block, and checking if zero before calling the add function. The code is branched out for dense and sparse vector, and it's not easy to maintain in the long term.
> Not only this activeIterator implementation increases the performance, but the abstraction of accessing the non-zero elements in different vector type also helps the maintainability of codebase. In this PR, only MultivariateOnlineSummarizer uses new API as example, and others can be migrated to activeIterator later. 
> Benchmarking with mnist8m dataset on single JVM with first 200 samples loaded in memory, and repeating 5000 times. 
> Before change: 
> Sparse Vector - 30.02
> Dense Vector - 38.27
> After this optimization:
> Sparse Vector - 27.54
> Dense Vector - 35.13



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org