You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by sujithjay <gi...@git.apache.org> on 2018/08/01 09:07:09 UTC

[GitHub] spark pull request #21942: [SPARK-24283][ML] Make ml.StandardScaler skip con...

GitHub user sujithjay opened a pull request:

    https://github.com/apache/spark/pull/21942

    [SPARK-24283][ML] Make ml.StandardScaler skip conversion of Spar…

    …k ml vectors to mllib vectors
    
    ## What changes were proposed in this pull request?
    Currently, ml.StandardScaler converts Spark  ml vectors to mllib vectors during transform operation. This PR makes changes to skip this step.
    
    ## How was this patch tested?
    Existing tests in StandardScalerSuite


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sujithjay/spark SPARK-24283

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21942.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21942
    
----
commit c010d1bb904717412591384f31bd085e3d98f502
Author: sujithjay <su...@...>
Date:   2018-08-01T08:37:10Z

    [SPARK-24283][ML][WIP] Make ml.StandardScaler skip conversion of Spark ml vectors to mllib vectors

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21942: [SPARK-24283][ML] Make ml.StandardScaler skip conversion...

Posted by sujithjay <gi...@git.apache.org>.

Github user sujithjay commented on the issue:

    https://github.com/apache/spark/pull/21942
  
    OK to test


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21942: [SPARK-24283][ML] Make ml.StandardScaler skip conversion...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21942
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21942: [SPARK-24283][ML] Make ml.StandardScaler skip con...

Posted by hhbyyh <gi...@git.apache.org>.

Github user hhbyyh commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21942#discussion_r207103066
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StandardScaler.scala ---
    @@ -160,15 +160,89 @@ class StandardScalerModel private[ml] (
       @Since("2.0.0")
       override def transform(dataset: Dataset[_]): DataFrame = {
         transformSchema(dataset.schema, logging = true)
    -    val scaler = new feature.StandardScalerModel(std, mean, $(withStd), $(withMean))
    -
    -    // TODO: Make the transformer natively in ml framework to avoid extra conversion.
    -    val transformer: Vector => Vector = v => scaler.transform(OldVectors.fromML(v)).asML
    +    val transformer: Vector => Vector = v => transform(v)
     
         val scale = udf(transformer)
         dataset.withColumn($(outputCol), scale(col($(inputCol))))
       }
     
    +  /**
    +   * Since `shift` will be only used in `withMean` branch, we have it as
    +   * `lazy val` so it will be evaluated in that branch. Note that we don't
    +   * want to create this array multiple times in `transform` function.
    +   */
    +  private lazy val shift: Array[Double] = mean.toArray
    +
    +   /**
    +    * Applies standardization transformation on a vector.
    +    *
    +    * @param vector Vector to be standardized.
    +    * @return Standardized vector. If the std of a column is zero, it will return default `0.0`
    +    *         for the column with zero std.
    +    */
    +  @Since("2.3.0")
    +  def transform(vector: Vector): Vector = {
    --- End diff --
    
    private[spark]?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21942: [SPARK-24283][ML] Make ml.StandardScaler skip conversion...

Posted by sujithjay <gi...@git.apache.org>.

Github user sujithjay commented on the issue:

    https://github.com/apache/spark/pull/21942
  
    Hi @mengxr, @jkbradley, @MLnick and @holdenk , could you please review this PR ? Thank you.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21942: [SPARK-24283][ML] Make ml.StandardScaler skip con...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21942#discussion_r216022250
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StandardScaler.scala ---
    @@ -160,15 +160,88 @@ class StandardScalerModel private[ml] (
       @Since("2.0.0")
       override def transform(dataset: Dataset[_]): DataFrame = {
         transformSchema(dataset.schema, logging = true)
    -    val scaler = new feature.StandardScalerModel(std, mean, $(withStd), $(withMean))
    -
    -    // TODO: Make the transformer natively in ml framework to avoid extra conversion.
    -    val transformer: Vector => Vector = v => scaler.transform(OldVectors.fromML(v)).asML
    +    val transformer: Vector => Vector = v => transform(v)
     
         val scale = udf(transformer)
         dataset.withColumn($(outputCol), scale(col($(inputCol))))
       }
     
    +  /**
    +   * Since `shift` will be only used in `withMean` branch, we have it as
    +   * `lazy val` so it will be evaluated in that branch. Note that we don't
    +   * want to create this array multiple times in `transform` function.
    +   */
    +  private lazy val shift: Array[Double] = mean.toArray
    +
    +   /**
    +    * Applies standardization transformation on a vector.
    +    *
    +    * @param vector Vector to be standardized.
    +    * @return Standardized vector. If the std of a column is zero, it will return default `0.0`
    +    *         for the column with zero std.
    +    */
    +  private[spark] def transform(vector: Vector): Vector = {
    +    require(mean.size == vector.size)
    +    if ($(withMean)) {
    +      /**
    +       * By default, Scala generates Java methods for member variables. So every time
    +       * member variables are accessed, `invokespecial` is called. This is an expensive
    +       * operation, and can be avoided by having a local reference of `shift`.
    +       */
    +      val localShift = shift
    +      /**  Must have a copy of the values since they will be modified in place. */
    +      val values = vector match {
    +        /** Handle DenseVector specially because its `toArray` method does not clone values. */
    +        case d: DenseVector => d.values.clone()
    +        case v: Vector => v.toArray
    +      }
    +      val size = values.length
    +      if ($(withStd)) {
    +        var i = 0
    +        while (i < size) {
    +          values(i) = if (std(i) != 0.0) (values(i) - localShift(i)) * (1.0 / std(i)) else 0.0
    +          i += 1
    +        }
    +      } else {
    +        var i = 0
    +        while (i < size) {
    +          values(i) -= localShift(i)
    +          i += 1
    +        }
    +      }
    +      Vectors.dense(values)
    +    } else if ($(withStd)) {
    --- End diff --
    
    Maybe leave a comment withStd and not mean since when tracing the code by hand the nested if/else if can get a bit confusing flow wise.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21942: [SPARK-24283][ML] Make ml.StandardScaler skip conversion...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21942
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21942: [SPARK-24283][ML] Make ml.StandardScaler skip conversion...

Posted by sujithjay <gi...@git.apache.org>.

Github user sujithjay commented on the issue:

    https://github.com/apache/spark/pull/21942
  
    @holdenk Could you also please take a look at this PR?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21942: [SPARK-24283][ML] Make ml.StandardScaler skip con...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21942#discussion_r216021077
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StandardScaler.scala ---
    @@ -160,15 +160,88 @@ class StandardScalerModel private[ml] (
       @Since("2.0.0")
       override def transform(dataset: Dataset[_]): DataFrame = {
         transformSchema(dataset.schema, logging = true)
    -    val scaler = new feature.StandardScalerModel(std, mean, $(withStd), $(withMean))
    -
    -    // TODO: Make the transformer natively in ml framework to avoid extra conversion.
    -    val transformer: Vector => Vector = v => scaler.transform(OldVectors.fromML(v)).asML
    +    val transformer: Vector => Vector = v => transform(v)
     
         val scale = udf(transformer)
         dataset.withColumn($(outputCol), scale(col($(inputCol))))
       }
     
    +  /**
    +   * Since `shift` will be only used in `withMean` branch, we have it as
    +   * `lazy val` so it will be evaluated in that branch. Note that we don't
    +   * want to create this array multiple times in `transform` function.
    +   */
    +  private lazy val shift: Array[Double] = mean.toArray
    --- End diff --
    
    How does this interplay with serialization? Would it make sense to evaluate this before we serialize the UDF so it isn't done on each worker or?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21942: [SPARK-24283][ML] Make ml.StandardScaler skip conversion...

Posted by holdensmagicalunicorn <gi...@git.apache.org>.

Github user holdensmagicalunicorn commented on the issue:

    https://github.com/apache/spark/pull/21942
  
    @sujithjay, thanks! I am a bot who has found some folks who might be able to help with the review:@mengxr, @jkbradley and @MLnick


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21942: [SPARK-24283][ML] Make ml.StandardScaler skip conversion...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/21942
  
    Sorry I've been slow to respond, I'll take a look today.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21942: [SPARK-24283][ML] Make ml.StandardScaler skip conversion...

Posted by hhbyyh <gi...@git.apache.org>.

Github user hhbyyh commented on the issue:

    https://github.com/apache/spark/pull/21942
  
    I think it's better to move the code and unit test in one PR. But since it's not a trivial change, I suggest you to wait for committers' opinion.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21942: [SPARK-24283][ML] Make ml.StandardScaler skip conversion...

Posted by sujithjay <gi...@git.apache.org>.

Github user sujithjay commented on the issue:

    https://github.com/apache/spark/pull/21942
  
    @hhbyyh I, personally, like the idea of moving the unit tests to ML. However, are you suggesting I move some of them as part of this PR? Or were you suggesting that it is a task we need to take up separately?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21942: [SPARK-24283][ML] Make ml.StandardScaler skip conversion...

Posted by sujithjay <gi...@git.apache.org>.

Github user sujithjay commented on the issue:

    https://github.com/apache/spark/pull/21942
  
    Hi @mengxr, @jkbradley, @MLnick, @holdenk, @viirya , could you please review this PR ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21942: [SPARK-24283][ML] Make ml.StandardScaler skip conversion...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21942
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org