You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by sujithjay <gi...@git.apache.org> on 2018/08/01 09:07:09 UTC
[GitHub] spark pull request #21942: [SPARK-24283][ML] Make ml.StandardScaler skip con...
GitHub user sujithjay opened a pull request:
https://github.com/apache/spark/pull/21942
[SPARK-24283][ML] Make ml.StandardScaler skip conversion of Spar…
…k ml vectors to mllib vectors
## What changes were proposed in this pull request?
Currently, ml.StandardScaler converts Spark ml vectors to mllib vectors during transform operation. This PR makes changes to skip this step.
## How was this patch tested?
Existing tests in StandardScalerSuite
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/sujithjay/spark SPARK-24283
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/21942.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21942
----
commit c010d1bb904717412591384f31bd085e3d98f502
Author: sujithjay <su...@...>
Date: 2018-08-01T08:37:10Z
[SPARK-24283][ML][WIP] Make ml.StandardScaler skip conversion of Spark ml vectors to mllib vectors
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21942: [SPARK-24283][ML] Make ml.StandardScaler skip conversion...
Posted by sujithjay <gi...@git.apache.org>.
Github user sujithjay commented on the issue:
https://github.com/apache/spark/pull/21942
OK to test
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21942: [SPARK-24283][ML] Make ml.StandardScaler skip conversion...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21942
Can one of the admins verify this patch?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21942: [SPARK-24283][ML] Make ml.StandardScaler skip con...
Posted by hhbyyh <gi...@git.apache.org>.
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/21942#discussion_r207103066
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StandardScaler.scala ---
@@ -160,15 +160,89 @@ class StandardScalerModel private[ml] (
@Since("2.0.0")
override def transform(dataset: Dataset[_]): DataFrame = {
transformSchema(dataset.schema, logging = true)
- val scaler = new feature.StandardScalerModel(std, mean, $(withStd), $(withMean))
-
- // TODO: Make the transformer natively in ml framework to avoid extra conversion.
- val transformer: Vector => Vector = v => scaler.transform(OldVectors.fromML(v)).asML
+ val transformer: Vector => Vector = v => transform(v)
val scale = udf(transformer)
dataset.withColumn($(outputCol), scale(col($(inputCol))))
}
+ /**
+ * Since `shift` will be only used in `withMean` branch, we have it as
+ * `lazy val` so it will be evaluated in that branch. Note that we don't
+ * want to create this array multiple times in `transform` function.
+ */
+ private lazy val shift: Array[Double] = mean.toArray
+
+ /**
+ * Applies standardization transformation on a vector.
+ *
+ * @param vector Vector to be standardized.
+ * @return Standardized vector. If the std of a column is zero, it will return default `0.0`
+ * for the column with zero std.
+ */
+ @Since("2.3.0")
+ def transform(vector: Vector): Vector = {
--- End diff --
private[spark]?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21942: [SPARK-24283][ML] Make ml.StandardScaler skip conversion...
Posted by sujithjay <gi...@git.apache.org>.
Github user sujithjay commented on the issue:
https://github.com/apache/spark/pull/21942
Hi @mengxr, @jkbradley, @MLnick and @holdenk , could you please review this PR ? Thank you.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21942: [SPARK-24283][ML] Make ml.StandardScaler skip con...
Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on a diff in the pull request:
https://github.com/apache/spark/pull/21942#discussion_r216022250
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StandardScaler.scala ---
@@ -160,15 +160,88 @@ class StandardScalerModel private[ml] (
@Since("2.0.0")
override def transform(dataset: Dataset[_]): DataFrame = {
transformSchema(dataset.schema, logging = true)
- val scaler = new feature.StandardScalerModel(std, mean, $(withStd), $(withMean))
-
- // TODO: Make the transformer natively in ml framework to avoid extra conversion.
- val transformer: Vector => Vector = v => scaler.transform(OldVectors.fromML(v)).asML
+ val transformer: Vector => Vector = v => transform(v)
val scale = udf(transformer)
dataset.withColumn($(outputCol), scale(col($(inputCol))))
}
+ /**
+ * Since `shift` will be only used in `withMean` branch, we have it as
+ * `lazy val` so it will be evaluated in that branch. Note that we don't
+ * want to create this array multiple times in `transform` function.
+ */
+ private lazy val shift: Array[Double] = mean.toArray
+
+ /**
+ * Applies standardization transformation on a vector.
+ *
+ * @param vector Vector to be standardized.
+ * @return Standardized vector. If the std of a column is zero, it will return default `0.0`
+ * for the column with zero std.
+ */
+ private[spark] def transform(vector: Vector): Vector = {
+ require(mean.size == vector.size)
+ if ($(withMean)) {
+ /**
+ * By default, Scala generates Java methods for member variables. So every time
+ * member variables are accessed, `invokespecial` is called. This is an expensive
+ * operation, and can be avoided by having a local reference of `shift`.
+ */
+ val localShift = shift
+ /** Must have a copy of the values since they will be modified in place. */
+ val values = vector match {
+ /** Handle DenseVector specially because its `toArray` method does not clone values. */
+ case d: DenseVector => d.values.clone()
+ case v: Vector => v.toArray
+ }
+ val size = values.length
+ if ($(withStd)) {
+ var i = 0
+ while (i < size) {
+ values(i) = if (std(i) != 0.0) (values(i) - localShift(i)) * (1.0 / std(i)) else 0.0
+ i += 1
+ }
+ } else {
+ var i = 0
+ while (i < size) {
+ values(i) -= localShift(i)
+ i += 1
+ }
+ }
+ Vectors.dense(values)
+ } else if ($(withStd)) {
--- End diff --
Maybe leave a comment withStd and not mean since when tracing the code by hand the nested if/else if can get a bit confusing flow wise.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21942: [SPARK-24283][ML] Make ml.StandardScaler skip conversion...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21942
Can one of the admins verify this patch?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21942: [SPARK-24283][ML] Make ml.StandardScaler skip conversion...
Posted by sujithjay <gi...@git.apache.org>.
Github user sujithjay commented on the issue:
https://github.com/apache/spark/pull/21942
@holdenk Could you also please take a look at this PR?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21942: [SPARK-24283][ML] Make ml.StandardScaler skip con...
Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on a diff in the pull request:
https://github.com/apache/spark/pull/21942#discussion_r216021077
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StandardScaler.scala ---
@@ -160,15 +160,88 @@ class StandardScalerModel private[ml] (
@Since("2.0.0")
override def transform(dataset: Dataset[_]): DataFrame = {
transformSchema(dataset.schema, logging = true)
- val scaler = new feature.StandardScalerModel(std, mean, $(withStd), $(withMean))
-
- // TODO: Make the transformer natively in ml framework to avoid extra conversion.
- val transformer: Vector => Vector = v => scaler.transform(OldVectors.fromML(v)).asML
+ val transformer: Vector => Vector = v => transform(v)
val scale = udf(transformer)
dataset.withColumn($(outputCol), scale(col($(inputCol))))
}
+ /**
+ * Since `shift` will be only used in `withMean` branch, we have it as
+ * `lazy val` so it will be evaluated in that branch. Note that we don't
+ * want to create this array multiple times in `transform` function.
+ */
+ private lazy val shift: Array[Double] = mean.toArray
--- End diff --
How does this interplay with serialization? Would it make sense to evaluate this before we serialize the UDF so it isn't done on each worker or?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21942: [SPARK-24283][ML] Make ml.StandardScaler skip conversion...
Posted by holdensmagicalunicorn <gi...@git.apache.org>.
Github user holdensmagicalunicorn commented on the issue:
https://github.com/apache/spark/pull/21942
@sujithjay, thanks! I am a bot who has found some folks who might be able to help with the review:@mengxr, @jkbradley and @MLnick
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21942: [SPARK-24283][ML] Make ml.StandardScaler skip conversion...
Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the issue:
https://github.com/apache/spark/pull/21942
Sorry I've been slow to respond, I'll take a look today.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21942: [SPARK-24283][ML] Make ml.StandardScaler skip conversion...
Posted by hhbyyh <gi...@git.apache.org>.
Github user hhbyyh commented on the issue:
https://github.com/apache/spark/pull/21942
I think it's better to move the code and unit test in one PR. But since it's not a trivial change, I suggest you to wait for committers' opinion.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21942: [SPARK-24283][ML] Make ml.StandardScaler skip conversion...
Posted by sujithjay <gi...@git.apache.org>.
Github user sujithjay commented on the issue:
https://github.com/apache/spark/pull/21942
@hhbyyh I, personally, like the idea of moving the unit tests to ML. However, are you suggesting I move some of them as part of this PR? Or were you suggesting that it is a task we need to take up separately?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21942: [SPARK-24283][ML] Make ml.StandardScaler skip conversion...
Posted by sujithjay <gi...@git.apache.org>.
Github user sujithjay commented on the issue:
https://github.com/apache/spark/pull/21942
Hi @mengxr, @jkbradley, @MLnick, @holdenk, @viirya , could you please review this PR ?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21942: [SPARK-24283][ML] Make ml.StandardScaler skip conversion...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21942
Can one of the admins verify this patch?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org