You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/12/26 09:02:56 UTC

[GitHub] [spark] zhengruifeng opened a new pull request #27014: [SPARK-29967][ML][FOLLOWUP] KMeans Cleanup

zhengruifeng opened a new pull request #27014: [SPARK-29967][ML][FOLLOWUP] KMeans Cleanup
URL: https://github.com/apache/spark/pull/27014
 
 
   ### What changes were proposed in this pull request?
   1, remove unused imports and variables
   2, remove `countAccum: LongAccumulator`, since `costAccum: DoubleAccumulator` also records the count
   3, mark `clusterCentersWithNorm` in KMeansModel trasient and lazy, since it is only used in transformation and can be directly generated from the centers.
   
   
   ### Why are the changes needed?
   1,remove unused codes
   2,avoid repeated computation
   3,reduce broadcasted size
   
   ### Does this PR introduce any user-facing change?
   No
   
   
   ### How was this patch tested?
   existing testsuites

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a change in pull request #27014: [SPARK-29967][ML][FOLLOWUP] KMeans Cleanup

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #27014: [SPARK-29967][ML][FOLLOWUP] KMeans Cleanup
URL: https://github.com/apache/spark/pull/27014#discussion_r361407783
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala
 ##########
 @@ -232,7 +232,7 @@ class KMeans private (
     val zippedData = data.zip(norms).map { case ((v, w), norm) =>
       (new VectorWithNorm(v, norm), w)
     }
-    zippedData.persist()
+    zippedData.persist(StorageLevel.MEMORY_AND_DISK)
 
 Review comment:
   Keep in line with other intermediate RDDs persisted in KMeans

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a change in pull request #27014: [SPARK-29967][ML][FOLLOWUP] KMeans Cleanup

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #27014: [SPARK-29967][ML][FOLLOWUP] KMeans Cleanup
URL: https://github.com/apache/spark/pull/27014#discussion_r361407427
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala
 ##########
 @@ -284,7 +283,6 @@ class KMeans private (
     // Execute iterations of Lloyd's algorithm until converged
     while (iteration < maxIterations && !converged) {
       val costAccum = sc.doubleAccumulator
-      val countAccum = sc.longAccumulator
 
 Review comment:
   Just found that `costAccum` also records the count

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27014: [SPARK-29967][ML][FOLLOWUP] KMeans Cleanup

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27014: [SPARK-29967][ML][FOLLOWUP] KMeans Cleanup
URL: https://github.com/apache/spark/pull/27014#issuecomment-569018356
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27014: [SPARK-29967][ML][FOLLOWUP] KMeans Cleanup

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27014: [SPARK-29967][ML][FOLLOWUP] KMeans Cleanup
URL: https://github.com/apache/spark/pull/27014#issuecomment-569018361
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/20598/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27014: [SPARK-29967][ML][FOLLOWUP] KMeans Cleanup

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27014: [SPARK-29967][ML][FOLLOWUP] KMeans Cleanup
URL: https://github.com/apache/spark/pull/27014#issuecomment-569018356
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27014: [SPARK-29967][ML][FOLLOWUP] KMeans Cleanup

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27014: [SPARK-29967][ML][FOLLOWUP] KMeans Cleanup
URL: https://github.com/apache/spark/pull/27014#issuecomment-569030080
 
 
   **[Test build #115805 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115805/testReport)** for PR 27014 at commit [`cf5d705`](https://github.com/apache/spark/commit/cf5d705c4f966fbae441184b56fd8ce02bdb0ce7).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27014: [SPARK-29967][ML][FOLLOWUP] KMeans Cleanup

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27014: [SPARK-29967][ML][FOLLOWUP] KMeans Cleanup
URL: https://github.com/apache/spark/pull/27014#issuecomment-569030256
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on a change in pull request #27014: [SPARK-29967][ML][FOLLOWUP] KMeans Cleanup

Posted by GitBox <gi...@apache.org>.
srowen commented on a change in pull request #27014: [SPARK-29967][ML][FOLLOWUP] KMeans Cleanup
URL: https://github.com/apache/spark/pull/27014#discussion_r361476598
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansModel.scala
 ##########
 @@ -42,10 +42,10 @@ class KMeansModel (@Since("1.0.0") val clusterCenters: Array[Vector],
   private[spark] val numIter: Int)
   extends Saveable with Serializable with PMMLExportable {
 
-  private val distanceMeasureInstance: DistanceMeasure =
+  @transient private lazy val distanceMeasureInstance: DistanceMeasure =
 
 Review comment:
   Is the reason that this not serializable? if it's just to not serialize it, it might not be worth the overhead of lazy, but won't matter much. I can understand not serializing the cluster centers twice, effectively

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng closed pull request #27014: [SPARK-29967][ML][FOLLOWUP] KMeans Cleanup

Posted by GitBox <gi...@apache.org>.
zhengruifeng closed pull request #27014: [SPARK-29967][ML][FOLLOWUP] KMeans Cleanup
URL: https://github.com/apache/spark/pull/27014
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #27014: [SPARK-29967][ML][FOLLOWUP] KMeans Cleanup

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27014: [SPARK-29967][ML][FOLLOWUP] KMeans Cleanup
URL: https://github.com/apache/spark/pull/27014#issuecomment-569018056
 
 
   **[Test build #115805 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115805/testReport)** for PR 27014 at commit [`cf5d705`](https://github.com/apache/spark/commit/cf5d705c4f966fbae441184b56fd8ce02bdb0ce7).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27014: [SPARK-29967][ML][FOLLOWUP] KMeans Cleanup

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27014: [SPARK-29967][ML][FOLLOWUP] KMeans Cleanup
URL: https://github.com/apache/spark/pull/27014#issuecomment-569030256
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a change in pull request #27014: [SPARK-29967][ML][FOLLOWUP] KMeans Cleanup

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #27014: [SPARK-29967][ML][FOLLOWUP] KMeans Cleanup
URL: https://github.com/apache/spark/pull/27014#discussion_r361407321
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala
 ##########
 @@ -258,7 +258,6 @@ class KMeans private (
     val distanceMeasureInstance = DistanceMeasure.decodeFromString(this.distanceMeasure)
 
     val dataVectorWithNorm = data.map(d => d._1)
-    val weights = data.map(d => d._2)
 
 Review comment:
   this var is unused

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27014: [SPARK-29967][ML][FOLLOWUP] KMeans Cleanup

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27014: [SPARK-29967][ML][FOLLOWUP] KMeans Cleanup
URL: https://github.com/apache/spark/pull/27014#issuecomment-569018056
 
 
   **[Test build #115805 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115805/testReport)** for PR 27014 at commit [`cf5d705`](https://github.com/apache/spark/commit/cf5d705c4f966fbae441184b56fd8ce02bdb0ce7).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27014: [SPARK-29967][ML][FOLLOWUP] KMeans Cleanup

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27014: [SPARK-29967][ML][FOLLOWUP] KMeans Cleanup
URL: https://github.com/apache/spark/pull/27014#issuecomment-569018361
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/20598/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27014: [SPARK-29967][ML][FOLLOWUP] KMeans Cleanup

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27014: [SPARK-29967][ML][FOLLOWUP] KMeans Cleanup
URL: https://github.com/apache/spark/pull/27014#issuecomment-569030260
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/115805/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27014: [SPARK-29967][ML][FOLLOWUP] KMeans Cleanup

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27014: [SPARK-29967][ML][FOLLOWUP] KMeans Cleanup
URL: https://github.com/apache/spark/pull/27014#issuecomment-569030260
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/115805/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org