You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2016/08/11 10:07:20 UTC

[jira] [Updated] (SPARK-17001) Enable standardScaler to standardize sparse vectors when withMean=True

     [ https://issues.apache.org/jira/browse/SPARK-17001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated SPARK-17001:
------------------------------
             Shepherd:   (was: Tobi Bosede)
                Flags:   (was: Important)
    Affects Version/s:     (was: 1.6.1)
                           (was: 1.5.1)
                           (was: 1.6.0)
                           (was: 1.4.1)
                           (was: 1.5.0)
                           (was: 1.4.0)
               Labels:   (was: performance)
             Priority: Minor  (was: Major)
          Description: 
When withMean = true, StandardScaler will not handle sparse vectors, and instead throw an exception. This is presumably because subtracting the mean makes a sparse vector dense, and this can be undesirable. 

However, VectorAssembler generates vectors that may be a mix of sparse and dense, even when vectors are smallish, depending on their values. It's common to feed this into StandardScaler, but it would fail sometimes depending on the input if withMean = true. This is kind of surprising.

StandardScaler should go ahead and operate on sparse vectors and subtract the mean, if explicitly asked to do so with withMean, on the theory that the user knows what he/she is doing, and there is otherwise no way to make this work.

  was:
standardScaler does not allow the mean to be subtracted from sparse vectors. It will only divide by the standard deviation to keep the vector sparse.  withMean=True should be default behavior and should apply an *offset if the vector is sparse, whereas there would be normal subtraction if the vector is dense. This way the default behavior of standardScaler will always be what is generally understood to be standardization, as opposed to people thinking they are standardizing when they are not. To allow the data to still fit in memory we want to avoid simply converting the sparse vector to a dense one.
*What is meant by "offset":
Imagine a sparse vector 1:3 3:7 which conceptually represents 0 3 0 7. Imagine it also has an offset stored which applies to all elements. If it is -2 then it now represents -2 1 -2 5, but this requires just one extra value to store. It only helps with storage of a shifted sparse vector; iterating still typically requires iterating all elements. 



[~anitobib@gmail.com] please read through https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark ; I'm going to edit this per discussion on the mailing list.

> Enable standardScaler to standardize sparse vectors when withMean=True
> ----------------------------------------------------------------------
>
>                 Key: SPARK-17001
>                 URL: https://issues.apache.org/jira/browse/SPARK-17001
>             Project: Spark
>          Issue Type: Improvement
>    Affects Versions: 2.0.0
>            Reporter: Tobi Bosede
>            Priority: Minor
>
> When withMean = true, StandardScaler will not handle sparse vectors, and instead throw an exception. This is presumably because subtracting the mean makes a sparse vector dense, and this can be undesirable. 
> However, VectorAssembler generates vectors that may be a mix of sparse and dense, even when vectors are smallish, depending on their values. It's common to feed this into StandardScaler, but it would fail sometimes depending on the input if withMean = true. This is kind of surprising.
> StandardScaler should go ahead and operate on sparse vectors and subtract the mean, if explicitly asked to do so with withMean, on the theory that the user knows what he/she is doing, and there is otherwise no way to make this work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org