You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by zhengruifeng <gi...@git.apache.org> on 2018/06/14 07:06:38 UTC

[GitHub] spark pull request #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/G...

GitHub user zhengruifeng opened a pull request:

    https://github.com/apache/spark/pull/21561

    [SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/NB

    ## What changes were proposed in this pull request?
    logNumExamples in KMeans/BiKM/GMM/AFT/NB
    
    ## How was this patch tested?
    existing tests

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/zhengruifeng/spark alg_logNumExamples

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21561.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21561
    
----
commit ec2171d77456554961558028c654293bea159cc7
Author: 郑瑞峰 <zh...@...>
Date:   2018-06-14T06:15:27Z

    init pr

commit 6ec59d2c2f61ebf05136660388b6887c9d452aca
Author: 郑瑞峰 <zh...@...>
Date:   2018-06-14T06:50:42Z

    add bikm

commit 61b95a35ecea4ae21e95fb8370bc4b4525370435
Author: 郑瑞峰 <zh...@...>
Date:   2018-06-14T07:00:12Z

    _

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21561
  
    **[Test build #93864 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93864/testReport)** for PR 21561 at commit [`96e8425`](https://github.com/apache/spark/commit/96e842558dc4005884f335a9a0a03ba02a852db0).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/G...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21561#discussion_r209860695
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala ---
    @@ -171,6 +169,8 @@ class BisectingKMeans private (
         val vectors = input.zip(norms).map { case (x, norm) => new VectorWithNorm(x, norm) }
         var assignments = vectors.map(v => (ROOT_INDEX, v))
         var activeClusters = summarize(d, assignments, dMeasure)
    +    val numSamples = activeClusters.values.map(_.size).sum
    --- End diff --
    
    ditto


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21561
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94781/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21561
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/127/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/G...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21561#discussion_r209541917
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala ---
    @@ -151,13 +152,9 @@ class BisectingKMeans private (
         this
       }
     
    -  /**
    -   * Runs the bisecting k-means algorithm.
    -   * @param input RDD of vectors
    -   * @return model for the bisecting kmeans
    -   */
    -  @Since("1.6.0")
    -  def run(input: RDD[Vector]): BisectingKMeansModel = {
    +
    +  private[spark] def run(input: RDD[Vector],
    +                         instr: Option[Instrumentation]): BisectingKMeansModel = {
    --- End diff --
    
    nit: indentation


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21561
  
    **[Test build #94669 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94669/testReport)** for PR 21561 at commit [`fb3ff2b`](https://github.com/apache/spark/commit/fb3ff2b1f6cfa3936f2aa3901be844555d33887e).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/G...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/21561


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21561
  
    **[Test build #93865 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93865/testReport)** for PR 21561 at commit [`2e48282`](https://github.com/apache/spark/commit/2e48282825a6fb46a50f4497491c550963f2c634).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21561
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21561
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21561
  
    **[Test build #91823 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91823/testReport)** for PR 21561 at commit [`61b95a3`](https://github.com/apache/spark/commit/61b95a35ecea4ae21e95fb8370bc4b4525370435).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21561
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93864/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/G...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21561#discussion_r209860247
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala ---
    @@ -299,7 +299,7 @@ class KMeans private (
           val bcCenters = sc.broadcast(centers)
     
           // Find the new centers
    -      val newCenters = data.mapPartitions { points =>
    +      val collected = data.mapPartitions { points =>
    --- End diff --
    
    nit: can we find a better name than `collected`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/G...

Posted by zhengruifeng <gi...@git.apache.org>.
Github user zhengruifeng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21561#discussion_r210467653
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala ---
    @@ -246,6 +245,16 @@ class BisectingKMeans private (
         new BisectingKMeansModel(root, this.distanceMeasure)
       }
     
    +  /**
    +   * Runs the bisecting k-means algorithm.
    +   * @param input RDD of vectors
    +   * @return model for the bisecting kmeans
    +   */
    +  @Since("1.6.0")
    --- End diff --
    
    this api was already existing since 1.6.0, so we should keep the since annotation?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/G...

Posted by zhengruifeng <gi...@git.apache.org>.
Github user zhengruifeng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21561#discussion_r209496789
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala ---
    @@ -157,11 +157,15 @@ class NaiveBayes @Since("1.5.0") (
         instr.logNumFeatures(numFeatures)
         val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) else col($(weightCol))
     
    +    val countAccum = dataset.sparkSession.sparkContext.longAccumulator
    +
         // Aggregates term frequencies per label.
         // TODO: Calling aggregateByKey and collect creates two stages, we can implement something
         // TODO: similar to reduceByKeyLocally to save one stage.
         val aggregated = dataset.select(col($(labelCol)), w, col($(featuresCol))).rdd
    -      .map { row => (row.getDouble(0), (row.getDouble(1), row.getAs[Vector](2)))
    +      .map { row =>
    +        countAccum.add(1L)
    --- End diff --
    
    This should work correctly, however, to guarantee the correctness, I update the pr to compute the number without Accumulator


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/G...

Posted by zhengruifeng <gi...@git.apache.org>.
Github user zhengruifeng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21561#discussion_r210158840
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala ---
    @@ -299,7 +299,7 @@ class KMeans private (
           val bcCenters = sc.broadcast(centers)
     
           // Find the new centers
    -      val newCenters = data.mapPartitions { points =>
    +      val collected = data.mapPartitions { points =>
    --- End diff --
    
    I am neutral on this.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21561
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1547/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/G...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21561#discussion_r210468107
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala ---
    @@ -246,6 +245,16 @@ class BisectingKMeans private (
         new BisectingKMeansModel(root, this.distanceMeasure)
       }
     
    +  /**
    +   * Runs the bisecting k-means algorithm.
    +   * @param input RDD of vectors
    +   * @return model for the bisecting kmeans
    +   */
    +  @Since("1.6.0")
    --- End diff --
    
    You couldn't call `BisectingKMeans.run(...)` before this, right? it wasn't in a superclass or anything. In that sense I think this method needs to be marked as new as of 2.4.0, right?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21561
  
    **[Test build #94669 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94669/testReport)** for PR 21561 at commit [`fb3ff2b`](https://github.com/apache/spark/commit/fb3ff2b1f6cfa3936f2aa3901be844555d33887e).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21561
  
    **[Test build #94718 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94718/testReport)** for PR 21561 at commit [`5f403fa`](https://github.com/apache/spark/commit/5f403faf1680be7acc5caca5931bf8bc1447bfcb).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/21561
  
    Merged to master


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/G...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21561#discussion_r209860657
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala ---
    @@ -317,7 +317,14 @@ class KMeans private (
           }.reduceByKey { case ((sum1, count1), (sum2, count2)) =>
             axpy(1.0, sum2, sum1)
             (sum1, count1 + count2)
    -      }.collectAsMap().mapValues { case (sum, count) =>
    +      }.collectAsMap()
    +
    +      if (iteration == 0) {
    +        val numSamples = collected.values.map(_._2).sum
    --- End diff --
    
    what about moving this in the `foreach`, so it is computed only id needed?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21561
  
    **[Test build #93866 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93866/testReport)** for PR 21561 at commit [`1a93c34`](https://github.com/apache/spark/commit/1a93c3432f95713e9a086a39e2f605ea4953619a).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21561
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93866/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/G...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21561#discussion_r210468884
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala ---
    @@ -246,6 +245,16 @@ class BisectingKMeans private (
         new BisectingKMeansModel(root, this.distanceMeasure)
       }
     
    +  /**
    +   * Runs the bisecting k-means algorithm.
    +   * @param input RDD of vectors
    +   * @return model for the bisecting kmeans
    +   */
    +  @Since("1.6.0")
    --- End diff --
    
    Oh right I get it now, this isn't a new method, it's 'replacing' the definition above. 👍 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/G...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21561#discussion_r210373934
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala ---
    @@ -246,6 +245,16 @@ class BisectingKMeans private (
         new BisectingKMeansModel(root, this.distanceMeasure)
       }
     
    +  /**
    +   * Runs the bisecting k-means algorithm.
    +   * @param input RDD of vectors
    +   * @return model for the bisecting kmeans
    +   */
    +  @Since("1.6.0")
    --- End diff --
    
    Nit: this should be since 2.4?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21561
  
    **[Test build #91823 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91823/testReport)** for PR 21561 at commit [`61b95a3`](https://github.com/apache/spark/commit/61b95a35ecea4ae21e95fb8370bc4b4525370435).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/G...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21561#discussion_r209256004
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala ---
    @@ -157,11 +157,15 @@ class NaiveBayes @Since("1.5.0") (
         instr.logNumFeatures(numFeatures)
         val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) else col($(weightCol))
     
    +    val countAccum = dataset.sparkSession.sparkContext.longAccumulator
    +
         // Aggregates term frequencies per label.
         // TODO: Calling aggregateByKey and collect creates two stages, we can implement something
         // TODO: similar to reduceByKeyLocally to save one stage.
         val aggregated = dataset.select(col($(labelCol)), w, col($(featuresCol))).rdd
    -      .map { row => (row.getDouble(0), (row.getDouble(1), row.getAs[Vector](2)))
    +      .map { row =>
    +        countAccum.add(1L)
    --- End diff --
    
    Is this guaranteed to work correctly, given that this is in a map operation? wondering if this introduces a correctness issue or whether this number is available elsewhere.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21561
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/G...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21561#discussion_r209256325
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala ---
    @@ -151,13 +152,9 @@ class BisectingKMeans private (
         this
       }
     
    -  /**
    -   * Runs the bisecting k-means algorithm.
    -   * @param input RDD of vectors
    -   * @return model for the bisecting kmeans
    -   */
    -  @Since("1.6.0")
    -  def run(input: RDD[Vector]): BisectingKMeansModel = {
    +
    +  private[spark] def run(input: RDD[Vector],
    +                         instr: Option[Instrumentation]): BisectingKMeansModel = {
    --- End diff --
    
    Elsewhere I see the instrumentation made available with "insrumented" -- is this different?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21561
  
    **[Test build #94718 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94718/testReport)** for PR 21561 at commit [`5f403fa`](https://github.com/apache/spark/commit/5f403faf1680be7acc5caca5931bf8bc1447bfcb).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21561
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21561
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93865/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21561
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/4016/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21561
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21561
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21561
  
    **[Test build #93865 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93865/testReport)** for PR 21561 at commit [`2e48282`](https://github.com/apache/spark/commit/2e48282825a6fb46a50f4497491c550963f2c634).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21561
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2157/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/G...

Posted by zhengruifeng <gi...@git.apache.org>.
Github user zhengruifeng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21561#discussion_r209498032
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala ---
    @@ -151,13 +152,9 @@ class BisectingKMeans private (
         this
       }
     
    -  /**
    -   * Runs the bisecting k-means algorithm.
    -   * @param input RDD of vectors
    -   * @return model for the bisecting kmeans
    -   */
    -  @Since("1.6.0")
    -  def run(input: RDD[Vector]): BisectingKMeansModel = {
    +
    +  private[spark] def run(input: RDD[Vector],
    +                         instr: Option[Instrumentation]): BisectingKMeansModel = {
    --- End diff --
    
     `instrumented` will create a new `Instrumentation`, and `instrumented` is only used in ml
    When mllib's impls is called, the `Instrumentation` will be passed as a parameters, like what KMeans does (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala#L362). 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21561
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21561
  
    **[Test build #93864 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93864/testReport)** for PR 21561 at commit [`96e8425`](https://github.com/apache/spark/commit/96e842558dc4005884f335a9a0a03ba02a852db0).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21561
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21561
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94718/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/G...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21561#discussion_r210205064
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala ---
    @@ -151,13 +152,10 @@ class BisectingKMeans private (
         this
       }
     
    -  /**
    -   * Runs the bisecting k-means algorithm.
    -   * @param input RDD of vectors
    -   * @return model for the bisecting kmeans
    -   */
    -  @Since("1.6.0")
    -  def run(input: RDD[Vector]): BisectingKMeansModel = {
    +
    --- End diff --
    
    nit: extra newline


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21561
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2200/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21561
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21561
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21561
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1549/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/G...

Posted by zhengruifeng <gi...@git.apache.org>.
Github user zhengruifeng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21561#discussion_r210468639
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala ---
    @@ -246,6 +245,16 @@ class BisectingKMeans private (
         new BisectingKMeansModel(root, this.distanceMeasure)
       }
     
    +  /**
    +   * Runs the bisecting k-means algorithm.
    +   * @param input RDD of vectors
    +   * @return model for the bisecting kmeans
    +   */
    +  @Since("1.6.0")
    --- End diff --
    
    `def run(input: RDD[Vector]): BisectingKMeansModel` is a public api since 1.6, and users can call it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21561
  
    **[Test build #93866 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93866/testReport)** for PR 21561 at commit [`1a93c34`](https://github.com/apache/spark/commit/1a93c3432f95713e9a086a39e2f605ea4953619a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21561
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21561
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91823/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21561
  
    **[Test build #94781 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94781/testReport)** for PR 21561 at commit [`ecab85c`](https://github.com/apache/spark/commit/ecab85c921fbc81865b800be25d533fea8e75fd5).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21561
  
    **[Test build #94781 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94781/testReport)** for PR 21561 at commit [`ecab85c`](https://github.com/apache/spark/commit/ecab85c921fbc81865b800be25d533fea8e75fd5).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21561
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94669/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21561
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2115/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21561
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21561
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1548/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21561
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21561
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21561
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org