You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by ludatabricks <gi...@git.apache.org> on 2018/04/30 17:16:58 UTC

[GitHub] spark pull request #21195: [Spark 23975][ML] Add support of array input for ...

GitHub user ludatabricks opened a pull request:

    https://github.com/apache/spark/pull/21195

    [Spark 23975][ML] Add support of array input for all clustering methods

    ## What changes were proposed in this pull request?
    
    Add support for all of the clustering methods
    
    ## How was this patch tested?
    
    unit tests added
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ludatabricks/spark-1 SPARK-23975-1

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21195.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21195
    
----
commit 31226b4b8e5aa5fc016f61ec86c42683c452a696
Author: Lu WANG <lu...@...>
Date:   2018-04-26T17:46:49Z

    add Array input support for BisectingKMeans

commit 45e6e96e974607ed0526401d0fdbb4f1c8161dd6
Author: Lu WANG <lu...@...>
Date:   2018-04-30T17:14:41Z

    add support of array input for all clustering methods

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21195: [Spark-23975][ML] Add support of array input for ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21195#discussion_r186555425
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/GaussianMixtureSuite.scala ---
    @@ -256,6 +257,22 @@ class GaussianMixtureSuite extends SparkFunSuite with MLlibTestSparkContext
         val expectedMatrix = GaussianMixture.unpackUpperTriangularMatrix(4, triangularValues)
         assert(symmetricMatrix === expectedMatrix)
       }
    +
    +  test("GaussianMixture with Array input") {
    +    def trainAndComputlogLikelihood(dataset: Dataset[_]): Double = {
    +      val model = new GaussianMixture().setK(k).setMaxIter(1).setSeed(1).fit(dataset)
    +      model.summary.logLikelihood
    +    }
    +
    +    val (newDataset, newDatasetD, newDatasetF) = MLTestingUtils.generateArrayFeatureDataset(dataset)
    +    val trueLikelihood = trainAndComputlogLikelihood(newDataset)
    +    val doubleLikelihood = trainAndComputlogLikelihood(newDatasetD)
    +    val floatLikelihood = trainAndComputlogLikelihood(newDatasetF)
    +
    +    // checking the cost is fine enough as a sanity check
    +    assert(trueLikelihood == doubleLikelihood)
    --- End diff --
    
    minor: should use `===` instead of `==` for assertions, the former gives a better error message. (not necessary to update this PR)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21195: [Spark-23975][ML] Add support of array input for ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21195#discussion_r186556798
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/util/MLTestingUtils.scala ---
    @@ -247,4 +247,21 @@ object MLTestingUtils extends SparkFunSuite {
         }
         models.sliding(2).foreach { case Seq(m1, m2) => modelEquals(m1, m2)}
       }
    +
    +  /**
    +   * Helper function for testing different input types for features. Given a DataFrame, generate
    +   * three output DataFrames: one having vector feature column with float precision, one having
    +   * double array feature column with float precision, and one having float array feature column.
    +   */
    +  def generateArrayFeatureDataset(dataset: Dataset[_]): (Dataset[_], Dataset[_], Dataset[_]) = {
    +    val toFloatVectorUDF = udf { (features: Vector) => features.toArray.map(_.toFloat).toVector}
    +    val toDoubleArrayUDF = udf { (features: Vector) => features.toArray}
    +    val toFloatArrayUDF = udf { (features: Vector) => features.toArray.map(_.toFloat)}
    +    val newDataset = dataset.withColumn("features", toFloatVectorUDF(col("features")))
    --- End diff --
    
    minor: maybe useful to define `"features"` as a constant at the beginning of the function


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21195: [Spark-23975][ML] Add support of array input for ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21195#discussion_r185983646
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/BisectingKMeansSuite.scala ---
    @@ -182,6 +184,40 @@ class BisectingKMeansSuite
     
         model.clusterCenters.forall(Vectors.norm(_, 2) == 1.0)
       }
    +
    +  test("BisectingKMeans with Array input") {
    +    val featuresColNameD = "array_double_features"
    +    val featuresColNameF = "array_float_features"
    +    val doubleUDF = udf { (features: Vector) =>
    +      val featureArray = Array.fill[Double](features.size)(0.0)
    +      features.foreachActive((idx, value) => featureArray(idx) = value.toFloat)
    +      featureArray
    +    }
    +    val floatUDF = udf { (features: Vector) =>
    +      val featureArray = Array.fill[Float](features.size)(0.0f)
    +      features.foreachActive((idx, value) => featureArray(idx) = value.toFloat)
    +      featureArray
    +    }
    +    val newdatasetD = dataset.withColumn(featuresColNameD, doubleUDF(col("features")))
    +      .drop("features")
    --- End diff --
    
    * Unnecessary to drop `features`. Or you can simply replace the features column:
    
    ~~~scala
    val newdatasetD = dataset.withColumn(FEATURES, doubleUDF(col(FEATURES)))
    ~~~


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21195: [Spark-23975][ML] Add support of array input for all clu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21195
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89981/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21195: [Spark-23975][ML] Add support of array input for all clu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21195
  
    **[Test build #89981 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89981/testReport)** for PR 21195 at commit [`45e6e96`](https://github.com/apache/spark/commit/45e6e96e974607ed0526401d0fdbb4f1c8161dd6).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21195: [Spark-23975][ML] Add support of array input for ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21195#discussion_r186555908
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/LDASuite.scala ---
    @@ -323,4 +324,21 @@ class LDASuite extends SparkFunSuite with MLlibTestSparkContext with DefaultRead
           assert(model.getOptimizer === optimizer)
         }
       }
    +
    +  test("LDA with Array input") {
    +    def trainAndLogLikelihoodAndPerplexity(dataset: Dataset[_]): (Double, Double) = {
    +      val model = new LDA().setK(k).setOptimizer("online").setMaxIter(1).setSeed(1).fit(dataset)
    +      (model.logLikelihood(dataset), model.logPerplexity(dataset))
    +    }
    +
    +    val (newDataset, newDatasetD, newDatasetF) = MLTestingUtils.generateArrayFeatureDataset(dataset)
    +    val (ll, lp) = trainAndLogLikelihoodAndPerplexity(newDataset)
    --- End diff --
    
    minor: the output are not used. I expect they will be used once we fixed SPARK-22210


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21195: [Spark-23975][ML] Add support of array input for all clu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21195
  
    **[Test build #89981 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89981/testReport)** for PR 21195 at commit [`45e6e96`](https://github.com/apache/spark/commit/45e6e96e974607ed0526401d0fdbb4f1c8161dd6).
     * This patch **fails SparkR unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21195: [Spark-23975][ML] Add support of array input for all clu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21195
  
    **[Test build #90335 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90335/testReport)** for PR 21195 at commit [`d065634`](https://github.com/apache/spark/commit/d065634ae70e5f0582eebcba84c90cc06e27e890).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21195: [Spark-23975][ML] Add support of array input for all clu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21195
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21195: [Spark-23975][ML] Add support of array input for ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21195#discussion_r186556808
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/util/MLTestingUtils.scala ---
    @@ -247,4 +247,21 @@ object MLTestingUtils extends SparkFunSuite {
         }
         models.sliding(2).foreach { case Seq(m1, m2) => modelEquals(m1, m2)}
       }
    +
    +  /**
    +   * Helper function for testing different input types for features. Given a DataFrame, generate
    +   * three output DataFrames: one having vector feature column with float precision, one having
    +   * double array feature column with float precision, and one having float array feature column.
    +   */
    +  def generateArrayFeatureDataset(dataset: Dataset[_]): (Dataset[_], Dataset[_], Dataset[_]) = {
    +    val toFloatVectorUDF = udf { (features: Vector) => features.toArray.map(_.toFloat).toVector}
    +    val toDoubleArrayUDF = udf { (features: Vector) => features.toArray}
    +    val toFloatArrayUDF = udf { (features: Vector) => features.toArray.map(_.toFloat)}
    +    val newDataset = dataset.withColumn("features", toFloatVectorUDF(col("features")))
    +    val newDatasetD = dataset.withColumn("features", toDoubleArrayUDF(col("features")))
    --- End diff --
    
    This doesn't truncate the precision to single. Did you want to use `newDataset` instead of `dataset`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21195: [Spark-23975][ML] Add support of array input for all clu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21195
  
    **[Test build #90344 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90344/testReport)** for PR 21195 at commit [`c9f478e`](https://github.com/apache/spark/commit/c9f478e990da34819828efc26c11149909e95ccc).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21195: [Spark-23975][ML] Add support of array input for all clu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21195
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90344/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21195: [Spark-23975][ML] Add support of array input for ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21195#discussion_r185984500
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/BisectingKMeansSuite.scala ---
    @@ -182,6 +184,40 @@ class BisectingKMeansSuite
     
         model.clusterCenters.forall(Vectors.norm(_, 2) == 1.0)
       }
    +
    +  test("BisectingKMeans with Array input") {
    +    val featuresColNameD = "array_double_features"
    +    val featuresColNameF = "array_float_features"
    +    val doubleUDF = udf { (features: Vector) =>
    +      val featureArray = Array.fill[Double](features.size)(0.0)
    +      features.foreachActive((idx, value) => featureArray(idx) = value.toFloat)
    +      featureArray
    +    }
    +    val floatUDF = udf { (features: Vector) =>
    +      val featureArray = Array.fill[Float](features.size)(0.0f)
    +      features.foreachActive((idx, value) => featureArray(idx) = value.toFloat)
    +      featureArray
    +    }
    +    val newdatasetD = dataset.withColumn(featuresColNameD, doubleUDF(col("features")))
    +      .drop("features")
    +    val newdatasetF = dataset.withColumn(featuresColNameF, floatUDF(col("features")))
    +      .drop("features")
    +    assert(newdatasetD.schema(featuresColNameD).dataType.equals(new ArrayType(DoubleType, false)))
    +    assert(newdatasetF.schema(featuresColNameF).dataType.equals(new ArrayType(FloatType, false)))
    +
    +    val bkmD = new BisectingKMeans()
    +      .setK(k).setMaxIter(1).setFeaturesCol(featuresColNameD).setSeed(1)
    +    val bkmF = new BisectingKMeans()
    +      .setK(k).setMaxIter(1).setFeaturesCol(featuresColNameF).setSeed(1)
    +    val modelD = bkmD.fit(newdatasetD)
    +    val modelF = bkmF.fit(newdatasetF)
    +    val transformedD = modelD.transform(newdatasetD)
    +    val transformedF = modelF.transform(newdatasetF)
    +    val predictDifference = transformedD.select("prediction")
    +      .except(transformedF.select("prediction"))
    +    assert(predictDifference.count() == 0)
    --- End diff --
    
    This only verifies it handles `Array[Double]` and `Array[Float]` the same way. But it doesn't guarantee that the result is correct. We can define a method that takes a dataset, apply one iteration, and return the cost.
    
    ~~~scala
    def trainAndComputeCost(dataset: DataFrame): Double = {
      val model = new BisectingKMeans()
        .setK(k).setMaxIter(1).setSeed(1)
        .fit(dataset)
      model.computeCost(dataset)
    }
    
    val trueCost = trainAndComputeCost(dataset)
    val floatArrayCost = trainAndComputeCost(newDatasetF)
    assert(floatArrayCost === trueCost)
    val doubleArrayCost = trainAndComputeCost(newDatasetD)
    assert(doubleArrayCost === trueCost)
    ~~~
    
    We can map the original dataset to single precision to have exact match. Or we can test equality with a threshold. See https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/mllib/util/TestingUtils.scala


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21195: [Spark-23975][ML] Add support of array input for ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21195#discussion_r185971385
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/SchemaUtils.scala ---
    @@ -101,4 +102,17 @@ private[spark] object SchemaUtils {
         require(!schema.fieldNames.contains(col.name), s"Column ${col.name} already exists.")
         StructType(schema.fields :+ col)
       }
    +
    +  /**
    +   * Check whether the given column in the schema is one of the supporting vector type: Vector,
    +   * Array[Dloat]. Array[Double]
    --- End diff --
    
    nit: Float


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21195: [Spark-23975][ML] Add support of array input for all clu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21195
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21195: [Spark-23975][ML] Add support of array input for all clu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21195
  
    **[Test build #90344 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90344/testReport)** for PR 21195 at commit [`c9f478e`](https://github.com/apache/spark/commit/c9f478e990da34819828efc26c11149909e95ccc).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21195: [Spark-23975][ML] Add support of array input for all clu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21195
  
    **[Test build #4166 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4166/testReport)** for PR 21195 at commit [`45e6e96`](https://github.com/apache/spark/commit/45e6e96e974607ed0526401d0fdbb4f1c8161dd6).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21195: [Spark-23975][ML] Add support of array input for all clu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21195
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21195: [Spark-23975][ML] Add support of array input for all clu...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/21195
  
    Rerunning tests in case the R CRAN failure was from flakiness


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21195: [Spark-23975][ML] Add support of array input for all clu...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the issue:

    https://github.com/apache/spark/pull/21195
  
    LGTM. Merged into master. Thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21195: [Spark-23975][ML] Add support of array input for ...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/21195


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21195: [Spark-23975][ML] Add support of array input for all clu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21195
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21195: [Spark-23975][ML] Add support of array input for all clu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21195
  
    **[Test build #90233 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90233/testReport)** for PR 21195 at commit [`c7a14bb`](https://github.com/apache/spark/commit/c7a14bb14132ce8de084192b02fcdaa1922986a9).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21195: [Spark-23975][ML] Add support of array input for all clu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21195
  
    **[Test build #90233 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90233/testReport)** for PR 21195 at commit [`c7a14bb`](https://github.com/apache/spark/commit/c7a14bb14132ce8de084192b02fcdaa1922986a9).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21195: [Spark-23975][ML] Add support of array input for ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21195#discussion_r185971647
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/BisectingKMeansSuite.scala ---
    @@ -182,6 +184,40 @@ class BisectingKMeansSuite
     
         model.clusterCenters.forall(Vectors.norm(_, 2) == 1.0)
       }
    +
    +  test("BisectingKMeans with Array input") {
    +    val featuresColNameD = "array_double_features"
    +    val featuresColNameF = "array_float_features"
    +    val doubleUDF = udf { (features: Vector) =>
    +      val featureArray = Array.fill[Double](features.size)(0.0)
    +      features.foreachActive((idx, value) => featureArray(idx) = value.toFloat)
    --- End diff --
    
    * If `.toFloat` is to keep the same precision, we should leave an inline comment.
    * `features.toArray.map(_.toFloat.toDouble)` should do the work.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21195: [Spark-23975][ML] Add support of array input for all clu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21195
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21195: [Spark-23975][ML] Add support of array input for all clu...

Posted by MrBago <gi...@git.apache.org>.

Github user MrBago commented on the issue:

    https://github.com/apache/spark/pull/21195
  
    Thanks Lu!
    
    I had a pass over this PR and it looks pretty straightforward. One thing I noticed is that there are two patterns that we keep repeating. I think we should add private APIs for these patterns and delegate to those.
    
    The first pattern is the validate schema method defined in terms of typeCandidates. I suggest we add something like `validateVectorCompatibleColumn` to `DatasetUtils`. In addition to helping with code reuse, this api would make it easier if we ever decide, for example, to support Arrays of Ints.
    
    The second pattern is going from a dataframe & column name to an rdd[OldVector]. Lets add a method that does this, maybe something like `(DataFrame, String) => RDD[OldVector]`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21195: [Spark-23975][ML] Add support of array input for ...

Posted by ludatabricks <gi...@git.apache.org>.

Github user ludatabricks commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21195#discussion_r186566521
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/LDASuite.scala ---
    @@ -323,4 +324,21 @@ class LDASuite extends SparkFunSuite with MLlibTestSparkContext with DefaultRead
           assert(model.getOptimizer === optimizer)
         }
       }
    +
    +  test("LDA with Array input") {
    +    def trainAndLogLikelihoodAndPerplexity(dataset: Dataset[_]): (Double, Double) = {
    +      val model = new LDA().setK(k).setOptimizer("online").setMaxIter(1).setSeed(1).fit(dataset)
    +      (model.logLikelihood(dataset), model.logPerplexity(dataset))
    +    }
    +
    +    val (newDataset, newDatasetD, newDatasetF) = MLTestingUtils.generateArrayFeatureDataset(dataset)
    +    val (ll, lp) = trainAndLogLikelihoodAndPerplexity(newDataset)
    --- End diff --
    
    Yes. I want to use this as the base for the comparison after we fix SPARK-22210.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21195: [Spark-23975][ML] Add support of array input for all clu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21195
  
    **[Test build #4166 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4166/testReport)** for PR 21195 at commit [`45e6e96`](https://github.com/apache/spark/commit/45e6e96e974607ed0526401d0fdbb4f1c8161dd6).
     * This patch **fails SparkR unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21195: [Spark-23975][ML] Add support of array input for all clu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21195
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90160/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21195: [Spark-23975][ML] Add support of array input for ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21195#discussion_r186556119
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/util/MLTestingUtils.scala ---
    @@ -247,4 +247,21 @@ object MLTestingUtils extends SparkFunSuite {
         }
         models.sliding(2).foreach { case Seq(m1, m2) => modelEquals(m1, m2)}
       }
    +
    +  /**
    +   * Helper function for testing different input types for features. Given a DataFrame, generate
    +   * three output DataFrames: one having vector feature column with float precision, one having
    --- End diff --
    
    minor: should say `features` column to make the contract clear.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21195: [Spark-23975][ML] Add support of array input for all clu...

Posted by MrBago <gi...@git.apache.org>.

Github user MrBago commented on the issue:

    https://github.com/apache/spark/pull/21195
  
    Looking now.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21195: [Spark-23975][ML] Add support of array input for all clu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21195
  
    **[Test build #90335 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90335/testReport)** for PR 21195 at commit [`d065634`](https://github.com/apache/spark/commit/d065634ae70e5f0582eebcba84c90cc06e27e890).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21195: [Spark-23975][ML] Add support of array input for all clu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21195
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90335/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21195: [Spark-23975][ML] Add support of array input for all clu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21195
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90233/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21195: [Spark-23975][ML] Add support of array input for all clu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21195
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21195: [Spark-23975][ML] Add support of array input for all clu...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/21195
  
    add to whitelist


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21195: [Spark-23975][ML] Add support of array input for ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21195#discussion_r185984894
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/LDASuite.scala ---
    @@ -323,4 +324,44 @@ class LDASuite extends SparkFunSuite with MLlibTestSparkContext with DefaultRead
           assert(model.getOptimizer === optimizer)
         }
       }
    +
    +  test("LDA with Array input") {
    +    val featuresColNameD = "array_double_features"
    +    val featuresColNameF = "array_float_features"
    +    val doubleUDF = udf { (features: Vector) =>
    +      val featureArray = Array.fill[Double](features.size)(0.0)
    +      features.foreachActive((idx, value) => featureArray(idx) = value.toFloat)
    +      featureArray
    +    }
    +    val floatUDF = udf { (features: Vector) =>
    +      val featureArray = Array.fill[Float](features.size)(0.0f)
    +      features.foreachActive((idx, value) => featureArray(idx) = value.toFloat)
    +      featureArray
    +    }
    +    val newdatasetD = dataset.withColumn(featuresColNameD, doubleUDF(col("features")))
    +      .drop("features")
    +    val newdatasetF = dataset.withColumn(featuresColNameF, floatUDF(col("features")))
    +      .drop("features")
    +    assert(newdatasetD.schema(featuresColNameD).dataType.equals(new ArrayType(DoubleType, false)))
    +    assert(newdatasetF.schema(featuresColNameF).dataType.equals(new ArrayType(FloatType, false)))
    +
    +    val ldaD = new LDA().setK(k).setOptimizer("online")
    +      .setMaxIter(1).setFeaturesCol(featuresColNameD).setSeed(1)
    +    val ldaF = new LDA().setK(k).setOptimizer("online").
    +      setMaxIter(1).setFeaturesCol(featuresColNameF).setSeed(1)
    +    val modelD = ldaD.fit(newdatasetD)
    +    val modelF = ldaF.fit(newdatasetF)
    +
    +    // logLikelihood, logPerplexity
    +    val llD = modelD.logLikelihood(newdatasetD)
    +    val llF = modelF.logLikelihood(newdatasetF)
    +    // assert(llD == llF)
    +    assert(llD <= 0.0 && llD != Double.NegativeInfinity)
    +    assert(llF <= 0.0 && llF != Double.NegativeInfinity)
    +    val lpD = modelD.logPerplexity(newdatasetD)
    +    val lpF = modelF.logPerplexity(newdatasetF)
    +    // assert(lpD == lpF)
    +    assert(lpD >= 0.0 && lpD != Double.NegativeInfinity)
    +    assert(lpF >= 0.0 && lpF != Double.NegativeInfinity)
    --- End diff --
    
    ditto


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21195: [Spark-23975][ML] Add support of array input for ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21195#discussion_r185984527
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/GaussianMixtureSuite.scala ---
    @@ -256,6 +258,42 @@ class GaussianMixtureSuite extends SparkFunSuite with MLlibTestSparkContext
         val expectedMatrix = GaussianMixture.unpackUpperTriangularMatrix(4, triangularValues)
         assert(symmetricMatrix === expectedMatrix)
       }
    +
    +  test("GaussianMixture with Array input") {
    +    val featuresColNameD = "array_double_features"
    +    val featuresColNameF = "array_float_features"
    +    val doubleUDF = udf { (features: Vector) =>
    +      val featureArray = Array.fill[Double](features.size)(0.0)
    +      features.foreachActive((idx, value) => featureArray(idx) = value.toFloat)
    +      featureArray
    +    }
    +    val floatUDF = udf { (features: Vector) =>
    +      val featureArray = Array.fill[Float](features.size)(0.0f)
    +      features.foreachActive((idx, value) => featureArray(idx) = value.toFloat)
    +      featureArray
    +    }
    +    val newdatasetD = dataset.withColumn(featuresColNameD, doubleUDF(col("features")))
    +      .drop("features")
    +    val newdatasetF = dataset.withColumn(featuresColNameF, floatUDF(col("features")))
    +      .drop("features")
    +    assert(newdatasetD.schema(featuresColNameD).dataType.equals(new ArrayType(DoubleType, false)))
    +    assert(newdatasetF.schema(featuresColNameF).dataType.equals(new ArrayType(FloatType, false)))
    +
    +    val gmD = new GaussianMixture().setK(k).setMaxIter(1)
    +      .setFeaturesCol(featuresColNameD).setSeed(1)
    +    val gmF = new GaussianMixture().setK(k).setMaxIter(1)
    +      .setFeaturesCol(featuresColNameF).setSeed(1)
    +    val modelD = gmD.fit(newdatasetD)
    +    val modelF = gmF.fit(newdatasetF)
    +    val transformedD = modelD.transform(newdatasetD)
    +    val transformedF = modelF.transform(newdatasetF)
    +    val predictDifference = transformedD.select("prediction")
    +      .except(transformedF.select("prediction"))
    +    assert(predictDifference.count() == 0)
    +    val probabilityDifference = transformedD.select("probability")
    +      .except(transformedF.select("probability"))
    +    assert(probabilityDifference.count() == 0)
    --- End diff --
    
    ditto


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21195: [Spark-23975][ML] Add support of array input for all clu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21195
  
    **[Test build #90160 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90160/testReport)** for PR 21195 at commit [`877c126`](https://github.com/apache/spark/commit/877c126ff493e43edb5a8bcf33e7dd1fe59503b0).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21195: [Spark-23975][ML] Add support of array input for all clu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21195
  
    **[Test build #90160 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90160/testReport)** for PR 21195 at commit [`877c126`](https://github.com/apache/spark/commit/877c126ff493e43edb5a8bcf33e7dd1fe59503b0).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org