You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by ludatabricks <gi...@git.apache.org> on 2018/04/16 18:00:45 UTC

[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

GitHub user ludatabricks opened a pull request:

    https://github.com/apache/spark/pull/21081

    [SPARK-23975][ML]Allow Clustering to take Arrays of Double as input features

    ## What changes were proposed in this pull request?
    
    - Multiple possible input types is added in validateAndTransformSchema() and computeCost() while checking column type
    
    - Add if statement in transform() to support array type as featuresCol
    
    - Add the case statement in fit() while selecting columns from dataset
    
    These changes will be applied to KMeans first, then to other clustering method
    
    ## How was this patch tested?
    
    unit test is added
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ludatabricks/spark-1 SPARK-23975

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21081.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21081
    
----
commit ed890d35ff1e9edbe2a557f68732835b3e911906
Author: Lu WANG <lu...@...>
Date:   2018-04-16T17:32:02Z

    add Array input support for KMeans

commit badb0cc5ca6ca69bb8e8fc0fce5ea05a4100bca0
Author: Lu WANG <lu...@...>
Date:   2018-04-16T17:49:00Z

    remove redundent code

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21081: [SPARK-23975][ML]Allow Clustering to take Arrays of Doub...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21081
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21081: [SPARK-23975][ML]Allow Clustering to take Arrays of Doub...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21081
  
    **[Test build #89417 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89417/testReport)** for PR 21081 at commit [`6d222a3`](https://github.com/apache/spark/commit/6d222a3f257c850e653c9c048fb8c15e44d2c48f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21081#discussion_r181841557
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
    @@ -194,6 +195,34 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultR
         assert(e.getCause.getMessage.contains("Cosine distance is not defined"))
       }
     
    +  test("KMean with Array input") {
    +    val featuresColName = "array_model_features"
    +
    +    val arrayUDF = udf { (features: Vector) =>
    +      features.toArray
    +    }
    +    val newdataset = dataset.withColumn(featuresColName, arrayUDF(col("features")) )
    +
    +    val kmeans = new KMeans()
    +      .setFeaturesCol(featuresColName)
    +
    +    assert(kmeans.getK === 2)
    +    assert(kmeans.getFeaturesCol === featuresColName)
    +    assert(kmeans.getPredictionCol === "prediction")
    +    assert(kmeans.getMaxIter === 20)
    +    assert(kmeans.getInitMode === MLlibKMeans.K_MEANS_PARALLEL)
    +    assert(kmeans.getInitSteps === 2)
    +    assert(kmeans.getTol === 1e-4)
    +    assert(kmeans.getDistanceMeasure === DistanceMeasure.EUCLIDEAN)
    +    val model = kmeans.setMaxIter(1).fit(newdataset)
    +
    +    MLTestingUtils.checkCopyAndUids(kmeans, model)
    --- End diff --
    
    ditto for hasSummary and copying


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/21081


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21081#discussion_r183555957
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala ---
    @@ -86,13 +86,23 @@ private[clustering] trait KMeansParams extends Params with HasMaxIter with HasFe
       @Since("1.5.0")
       def getInitSteps: Int = $(initSteps)
     
    +  /**
    +   * Validates the input schema.
    +   * @param schema input schema
    +   */
    +  private[clustering] def validateSchema(schema: StructType): Unit = {
    +    val typeCandidates = List( new VectorUDT,
    +      new ArrayType(DoubleType, false),
    +      new ArrayType(FloatType, false))
    +    SchemaUtils.checkColumnTypes(schema, $(featuresCol), typeCandidates)
    +  }
       /**
    --- End diff --
    
    scala style: always put newline between methods


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21081: [SPARK-23975][ML]Allow Clustering to take Arrays of Doub...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21081
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89410/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21081: [SPARK-23975][ML]Allow Clustering to take Arrays of Doub...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21081
  
    **[Test build #89585 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89585/testReport)** for PR 21081 at commit [`009b918`](https://github.com/apache/spark/commit/009b918c8734b19f9f9b34a31c23d6ad582c7465).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21081: [SPARK-23975][ML]Allow Clustering to take Arrays of Doub...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/21081
  
    @jkbradley Will this be applied to other algos besides clustering algos ? and how to support sparse float features ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21081#discussion_r181840765
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
    @@ -194,6 +195,34 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultR
         assert(e.getCause.getMessage.contains("Cosine distance is not defined"))
       }
     
    +  test("KMean with Array input") {
    +    val featuresColName = "array_model_features"
    +
    +    val arrayUDF = udf { (features: Vector) =>
    +      features.toArray
    +    }
    +    val newdataset = dataset.withColumn(featuresColName, arrayUDF(col("features")) )
    +
    +    val kmeans = new KMeans()
    +      .setFeaturesCol(featuresColName)
    +
    +    assert(kmeans.getK === 2)
    +    assert(kmeans.getFeaturesCol === featuresColName)
    +    assert(kmeans.getPredictionCol === "prediction")
    --- End diff --
    
    No need to check this or the other Params which are not relevant to this test


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21081#discussion_r182925299
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala ---
    @@ -305,15 +344,45 @@ class KMeans @Since("1.5.0") (
       @Since("1.5.0")
       def setSeed(value: Long): this.type = set(seed, value)
     
    +  @Since("2.4.0")
    +  def featureToVector(dataset: Dataset[_], col: Column): Column = {
    --- End diff --
    
    Is this a copy of the same method?  It should be shared, either in KMeansParams or in a static (object) method.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21081#discussion_r183556811
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/DatasetUtils.scala ---
    @@ -0,0 +1,54 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.util
    +
    +import org.apache.spark.annotation.Since
    +import org.apache.spark.ml.linalg.{Vectors, VectorUDT}
    +import org.apache.spark.sql.{Column, Dataset}
    +import org.apache.spark.sql.functions.{col, udf}
    +import org.apache.spark.sql.types.{ArrayType, DoubleType, FloatType}
    +
    +
    +private[spark] object DatasetUtils {
    +
    +  /**
    +   * preprocessing the input feature column to Vector
    --- End diff --
    
    This is a bit unclear.  How about: "Cast a column in a Dataset to a Vector type."
    Also, this isn't specific to features, so please clarify that below.
    Finally, the key thing to document is the list of supported input types, so I'd add that.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21081: [SPARK-23975][ML]Allow Clustering to take Arrays of Doub...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21081
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89743/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21081: [SPARK-23975][ML]Allow Clustering to take Arrays of Doub...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21081
  
    **[Test build #89417 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89417/testReport)** for PR 21081 at commit [`6d222a3`](https://github.com/apache/spark/commit/6d222a3f257c850e653c9c048fb8c15e44d2c48f).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21081#discussion_r182216309
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala ---
    @@ -90,7 +90,12 @@ private[clustering] trait KMeansParams extends Params with HasMaxIter with HasFe
        * @return output schema
        */
       protected def validateAndTransformSchema(schema: StructType): StructType = {
    -    SchemaUtils.checkColumnType(schema, $(featuresCol), new VectorUDT)
    +    val typeCandidates = List( new VectorUDT,
    +      new ArrayType(DoubleType, true),
    --- End diff --
    
    Thinking about this, let's actually disallow nullable columns.  KMeans won't handle nulls properly.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21081#discussion_r182215434
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala ---
    @@ -123,7 +128,21 @@ class KMeansModel private[ml] (
       @Since("2.0.0")
       override def transform(dataset: Dataset[_]): DataFrame = {
         transformSchema(dataset.schema, logging = true)
    -    val predictUDF = udf((vector: Vector) => predict(vector))
    +    // val predictUDF = udf((vector: Vector) => predict(vector))
    +    val predictUDF = if (dataset.schema($(featuresCol)).dataType.equals(new VectorUDT)) {
    +      udf((vector: Vector) => predict(vector))
    +    }
    --- End diff --
    
    Scala style: } else {


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21081: [SPARK-23975][ML]Allow Clustering to take Arrays of Doub...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21081
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21081: [SPARK-23975][ML]Allow Clustering to take Arrays of Doub...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21081
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89738/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21081: [SPARK-23975][ML]Allow Clustering to take Arrays of Doub...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21081
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89749/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21081: [SPARK-23975][ML]Allow Clustering to take Arrays of Doub...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21081
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21081: [SPARK-23975][ML]Allow Clustering to take Arrays of Doub...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21081
  
    Build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21081#discussion_r182269723
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala ---
    @@ -123,7 +128,21 @@ class KMeansModel private[ml] (
       @Since("2.0.0")
       override def transform(dataset: Dataset[_]): DataFrame = {
         transformSchema(dataset.schema, logging = true)
    -    val predictUDF = udf((vector: Vector) => predict(vector))
    +    // val predictUDF = udf((vector: Vector) => predict(vector))
    +    val predictUDF = if (dataset.schema($(featuresCol)).dataType.equals(new VectorUDT)) {
    +      udf((vector: Vector) => predict(vector))
    +    }
    +    else {
    +      udf((vector: Seq[_]) => {
    +        val featureArray = Array.fill[Double](vector.size)(0.0)
    --- End diff --
    
    Here's what I meant:
    ```
        val predictUDF = featuresDataType match {
          case _: VectorUDT =>
            udf((vector: Vector) => predict(vector))
          case fdt: ArrayType => fdt.elementType match {
            case _: FloatType =>
              ???
            case _: DoubleType =>
              ???
          }
        }
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21081#discussion_r183796459
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala ---
    @@ -86,13 +86,23 @@ private[clustering] trait KMeansParams extends Params with HasMaxIter with HasFe
       @Since("1.5.0")
       def getInitSteps: Int = $(initSteps)
     
    +  /**
    +   * Validates the input schema.
    +   * @param schema input schema
    +   */
    +  private[clustering] def validateSchema(schema: StructType): Unit = {
    +    val typeCandidates = List( new VectorUDT,
    +      new ArrayType(DoubleType, false),
    +      new ArrayType(FloatType, false))
    +    SchemaUtils.checkColumnTypes(schema, $(featuresCol), typeCandidates)
    +  }
       /**
    --- End diff --
    
    Ping: There needs to be a newline between the "}" of the previous method and the "/**" Scaladoc of the next method.  Please start checking for this.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21081: [SPARK-23975][ML]Allow Clustering to take Arrays of Doub...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21081
  
    **[Test build #89743 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89743/testReport)** for PR 21081 at commit [`3e012fb`](https://github.com/apache/spark/commit/3e012fba3470c0f938e33cd3c783dff5ee068fcf).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21081#discussion_r182925491
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala ---
    @@ -144,8 +168,23 @@ class KMeansModel private[ml] (
       // TODO: Replace the temp fix when we have proper evaluators defined for clustering.
       @Since("2.0.0")
       def computeCost(dataset: Dataset[_]): Double = {
    -    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    -    val data: RDD[OldVector] = dataset.select(col($(featuresCol))).rdd.map {
    +    val typeCandidates = List( new VectorUDT,
    --- End diff --
    
    You can reuse validateAndTransformSchema here.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21081: [SPARK-23975][ML]Allow Clustering to take Arrays of Doub...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21081
  
    **[Test build #89738 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89738/testReport)** for PR 21081 at commit [`fee36ad`](https://github.com/apache/spark/commit/fee36ad2deaa013c8fa61cae399a274cd7681fa9).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21081: [SPARK-23975][ML]Allow Clustering to take Arrays of Doub...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21081
  
    **[Test build #89410 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89410/testReport)** for PR 21081 at commit [`badb0cc`](https://github.com/apache/spark/commit/badb0cc5ca6ca69bb8e8fc0fce5ea05a4100bca0).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21081: [SPARK-23975][ML]Allow Clustering to take Arrays of Doub...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21081
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89585/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21081#discussion_r181841713
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
    @@ -194,6 +195,34 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultR
         assert(e.getCause.getMessage.contains("Cosine distance is not defined"))
       }
     
    +  test("KMean with Array input") {
    --- End diff --
    
    This should check transform and computeCost too


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21081#discussion_r181841503
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
    @@ -194,6 +195,34 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultR
         assert(e.getCause.getMessage.contains("Cosine distance is not defined"))
       }
     
    +  test("KMean with Array input") {
    +    val featuresColName = "array_model_features"
    +
    +    val arrayUDF = udf { (features: Vector) =>
    +      features.toArray
    +    }
    +    val newdataset = dataset.withColumn(featuresColName, arrayUDF(col("features")) )
    +
    +    val kmeans = new KMeans()
    +      .setFeaturesCol(featuresColName)
    +
    +    assert(kmeans.getK === 2)
    +    assert(kmeans.getFeaturesCol === featuresColName)
    +    assert(kmeans.getPredictionCol === "prediction")
    +    assert(kmeans.getMaxIter === 20)
    +    assert(kmeans.getInitMode === MLlibKMeans.K_MEANS_PARALLEL)
    +    assert(kmeans.getInitSteps === 2)
    +    assert(kmeans.getTol === 1e-4)
    +    assert(kmeans.getDistanceMeasure === DistanceMeasure.EUCLIDEAN)
    +    val model = kmeans.setMaxIter(1).fit(newdataset)
    +
    +    MLTestingUtils.checkCopyAndUids(kmeans, model)
    --- End diff --
    
    You don't need this test here


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21081: [SPARK-23975][ML]Allow Clustering to take Arrays of Doub...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21081
  
    **[Test build #89660 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89660/testReport)** for PR 21081 at commit [`cd988c7`](https://github.com/apache/spark/commit/cd988c7f2e4c2cb1f2006e264a24529a72c9a5cf).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21081: [SPARK-23975][ML]Allow Clustering to take Arrays of Doub...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21081
  
    **[Test build #89585 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89585/testReport)** for PR 21081 at commit [`009b918`](https://github.com/apache/spark/commit/009b918c8734b19f9f9b34a31c23d6ad582c7465).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21081#discussion_r183558056
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
    @@ -199,6 +201,47 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultR
         assert(e.getCause.getMessage.contains("Cosine distance is not defined"))
       }
     
    +  test("KMean with Array input") {
    +    val featuresColNameD = "array_double_features"
    +    val featuresColNameF = "array_float_features"
    +
    +    val doubleUDF = udf { (features: Vector) =>
    +      val featureArray = Array.fill[Double](features.size)(0.0)
    +      features.foreachActive((idx, value) => featureArray(idx) = value.toFloat)
    +      featureArray
    +    }
    +    val floatUDF = udf { (features: Vector) =>
    +      val featureArray = Array.fill[Float](features.size)(0.0f)
    +      features.foreachActive((idx, value) => featureArray(idx) = value.toFloat)
    +      featureArray
    +    }
    +
    +    val newdatasetD = dataset.withColumn(featuresColNameD, doubleUDF(col("features")))
    +      .drop("features")
    +    val newdatasetF = dataset.withColumn(featuresColNameF, floatUDF(col("features")))
    +      .drop("features")
    +
    +    assert(newdatasetD.schema(featuresColNameD).dataType.equals(new ArrayType(DoubleType, false)))
    +    assert(newdatasetF.schema(featuresColNameF).dataType.equals(new ArrayType(FloatType, false)))
    +
    +    val kmeansD = new KMeans().setK(k).setFeaturesCol(featuresColNameD).setSeed(1)
    --- End diff --
    
    Also do: `setMaxIter(1)` to make this a little faster.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21081: [SPARK-23975][ML]Allow Clustering to take Arrays of Doub...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21081
  
    **[Test build #89743 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89743/testReport)** for PR 21081 at commit [`3e012fb`](https://github.com/apache/spark/commit/3e012fba3470c0f938e33cd3c783dff5ee068fcf).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21081#discussion_r183797106
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/DatasetUtils.scala ---
    @@ -27,28 +26,38 @@ import org.apache.spark.sql.types.{ArrayType, DoubleType, FloatType}
     private[spark] object DatasetUtils {
     
       /**
    -   * preprocessing the input feature column to Vector
    -   * @param dataset DataFrame with columns for features
    -   * @param colName column name for features
    -   * @return Vector feature column
    +   * Cast a column in a Dataset to Vector type.
    +   *
    +   * The supported data types of the input column are
    +   * - Vector
    +   * - float/double type Array.
    +   *
    +   * Note: The returned column does not have Metadata.
    +   *
    +   * @param dataset input DataFrame
    +   * @param colName column name.
    +   * @return Vector column
        */
    -  @Since("2.4.0")
       def columnToVector(dataset: Dataset[_], colName: String): Column = {
    -    val featuresDataType = dataset.schema(colName).dataType
    -    featuresDataType match {
    +    val columnDataType = dataset.schema(colName).dataType
    +    columnDataType match {
           case _: VectorUDT => col(colName)
           case fdt: ArrayType =>
             val transferUDF = fdt.elementType match {
               case _: FloatType => udf(f = (vector: Seq[Float]) => {
    -            val featureArray = Array.fill[Double](vector.size)(0.0)
    -            vector.indices.foreach(idx => featureArray(idx) = vector(idx).toDouble)
    -            Vectors.dense(featureArray)
    +            val inputArray = Array.fill[Double](vector.size)(0.0)
    +            vector.indices.foreach(idx => inputArray(idx) = vector(idx).toDouble)
    +            Vectors.dense(inputArray)
               })
               case _: DoubleType => udf((vector: Seq[Double]) => {
                 Vectors.dense(vector.toArray)
               })
    +          case other =>
    --- End diff --
    
    Thanks!  I forgot about this since this was generalized.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21081#discussion_r182924819
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala ---
    @@ -120,11 +123,32 @@ class KMeansModel private[ml] (
       @Since("2.0.0")
       def setPredictionCol(value: String): this.type = set(predictionCol, value)
     
    +  @Since("2.4.0")
    +  def featureToVector(dataset: Dataset[_], col: Column): Column = {
    --- End diff --
    
    Make this private.  In general, we try to keep APIs as private as possible since that allows us more flexibility to make changes in the future.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21081: [SPARK-23975][ML]Allow Clustering to take Arrays of Doub...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21081
  
    **[Test build #89410 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89410/testReport)** for PR 21081 at commit [`badb0cc`](https://github.com/apache/spark/commit/badb0cc5ca6ca69bb8e8fc0fce5ea05a4100bca0).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21081: [SPARK-23975][ML]Allow Clustering to take Arrays of Doub...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21081
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21081#discussion_r182217722
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala ---
    @@ -123,7 +128,21 @@ class KMeansModel private[ml] (
       @Since("2.0.0")
       override def transform(dataset: Dataset[_]): DataFrame = {
         transformSchema(dataset.schema, logging = true)
    -    val predictUDF = udf((vector: Vector) => predict(vector))
    +    // val predictUDF = udf((vector: Vector) => predict(vector))
    +    val predictUDF = if (dataset.schema($(featuresCol)).dataType.equals(new VectorUDT)) {
    +      udf((vector: Vector) => predict(vector))
    +    }
    +    else {
    +      udf((vector: Seq[_]) => {
    +        val featureArray = Array.fill[Double](vector.size)(0.0)
    --- End diff --
    
    You shouldn't have to do the conversion in this convoluted (and less efficient) way.  I'd recommend doing a match-case statement on dataset.schema; I think that will be the most succinct.  Then you can handle Vector, Seq of Float, and Seq of Double separately, without conversions to strings.
    
    Same for the similar cases below.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21081#discussion_r181847784
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala ---
    @@ -123,8 +128,15 @@ class KMeansModel private[ml] (
       @Since("2.0.0")
       override def transform(dataset: Dataset[_]): DataFrame = {
         transformSchema(dataset.schema, logging = true)
    -    val predictUDF = udf((vector: Vector) => predict(vector))
    -    dataset.withColumn($(predictionCol), predictUDF(col($(featuresCol))))
    +    // val predictUDF = udf((vector: Vector) => predict(vector))
    +    if (dataset.schema($(featuresCol)).dataType.equals(new VectorUDT)) {
    +      val predictUDF = udf((vector: Vector) => predict(vector))
    +      dataset.withColumn($(predictionCol), predictUDF(col($(featuresCol))))
    +    } else {
    +      val predictUDF = udf((vector: Seq[_]) =>
    +        predict(Vectors.dense(vector.asInstanceOf[Seq[Double]].toArray)))
    --- End diff --
    
    This may not work with arrays of FloatType.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21081#discussion_r183556424
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/DatasetUtils.scala ---
    @@ -0,0 +1,54 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.util
    +
    +import org.apache.spark.annotation.Since
    +import org.apache.spark.ml.linalg.{Vectors, VectorUDT}
    +import org.apache.spark.sql.{Column, Dataset}
    +import org.apache.spark.sql.functions.{col, udf}
    +import org.apache.spark.sql.types.{ArrayType, DoubleType, FloatType}
    +
    +
    +private[spark] object DatasetUtils {
    +
    +  /**
    +   * preprocessing the input feature column to Vector
    +   * @param dataset DataFrame with columns for features
    +   * @param colName column name for features
    +   * @return Vector feature column
    +   */
    +  @Since("2.4.0")
    --- End diff --
    
    Don't add Since annotations to private APIs.  They can get Since annotations when they are made public.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21081#discussion_r181847061
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala ---
    @@ -144,7 +156,12 @@ class KMeansModel private[ml] (
       // TODO: Replace the temp fix when we have proper evaluators defined for clustering.
       @Since("2.0.0")
       def computeCost(dataset: Dataset[_]): Double = {
    -    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    val typeCandidates = List( new VectorUDT,
    +      new ArrayType(DoubleType, true),
    +      new ArrayType(DoubleType, false),
    +      new ArrayType(FloatType, true),
    +      new ArrayType(FloatType, false))
    +    SchemaUtils.checkColumnTypes(dataset.schema, $(featuresCol), typeCandidates)
         val data: RDD[OldVector] = dataset.select(col($(featuresCol))).rdd.map {
    --- End diff --
    
    this won't take non-Vector types though; a unit test would catch this


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21081#discussion_r183557655
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/DatasetUtils.scala ---
    @@ -0,0 +1,54 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.util
    +
    +import org.apache.spark.annotation.Since
    +import org.apache.spark.ml.linalg.{Vectors, VectorUDT}
    +import org.apache.spark.sql.{Column, Dataset}
    +import org.apache.spark.sql.functions.{col, udf}
    +import org.apache.spark.sql.types.{ArrayType, DoubleType, FloatType}
    +
    +
    +private[spark] object DatasetUtils {
    +
    +  /**
    +   * preprocessing the input feature column to Vector
    +   * @param dataset DataFrame with columns for features
    +   * @param colName column name for features
    +   * @return Vector feature column
    --- End diff --
    
    Add a note that this returned Column does not have Metadata


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21081: [SPARK-23975][ML]Allow Clustering to take Arrays of Doub...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21081
  
    **[Test build #89749 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89749/testReport)** for PR 21081 at commit [`c4e1a51`](https://github.com/apache/spark/commit/c4e1a51551993008d7b082b112d2296cbc4eb97b).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21081: [SPARK-23975][ML]Allow Clustering to take Arrays of Doub...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/21081
  
    I hope we can apply it to other algs too.  @ludatabricks is doing some refactoring which should make that easier, but we're not going for a completely general approach right away.
    
    I don't think we need to worry about sparse FloatType features; users have no way to pass those in.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21081: [SPARK-23975][ML]Allow Clustering to take Arrays of Doub...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21081
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21081#discussion_r182925210
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala ---
    @@ -120,11 +123,32 @@ class KMeansModel private[ml] (
       @Since("2.0.0")
       def setPredictionCol(value: String): this.type = set(predictionCol, value)
     
    +  @Since("2.4.0")
    +  def featureToVector(dataset: Dataset[_], col: Column): Column = {
    --- End diff --
    
    Also, add a Scala docstring saying what this does.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21081: [SPARK-23975][ML]Allow Clustering to take Arrays of Doub...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21081
  
    **[Test build #4157 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4157/testReport)** for PR 21081 at commit [`c4e1a51`](https://github.com/apache/spark/commit/c4e1a51551993008d7b082b112d2296cbc4eb97b).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21081#discussion_r182269644
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala ---
    @@ -123,7 +128,21 @@ class KMeansModel private[ml] (
       @Since("2.0.0")
       override def transform(dataset: Dataset[_]): DataFrame = {
         transformSchema(dataset.schema, logging = true)
    -    val predictUDF = udf((vector: Vector) => predict(vector))
    +    // val predictUDF = udf((vector: Vector) => predict(vector))
    +    val predictUDF = if (dataset.schema($(featuresCol)).dataType.equals(new VectorUDT)) {
    +      udf((vector: Vector) => predict(vector))
    --- End diff --
    
    Side note: I realized that "predict" will cause the whole model to be serialized and sent to workers.  But that's actually OK since we do need to send most of the model data to make predictions and since there's not a clean way to just sent the model weights.  So I think my previous comment about copying "numClasses" to a local variable was not necessary.  Don't bother reverting the change though.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21081: [SPARK-23975][ML]Allow Clustering to take Arrays of Doub...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21081
  
    **[Test build #89660 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89660/testReport)** for PR 21081 at commit [`cd988c7`](https://github.com/apache/spark/commit/cd988c7f2e4c2cb1f2006e264a24529a72c9a5cf).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21081: [SPARK-23975][ML]Allow Clustering to take Arrays of Doub...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21081
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89417/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21081: [SPARK-23975][ML]Allow Clustering to take Arrays of Doub...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21081
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21081#discussion_r182216415
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala ---
    @@ -90,7 +90,12 @@ private[clustering] trait KMeansParams extends Params with HasMaxIter with HasFe
        * @return output schema
        */
       protected def validateAndTransformSchema(schema: StructType): StructType = {
    -    SchemaUtils.checkColumnType(schema, $(featuresCol), new VectorUDT)
    +    val typeCandidates = List( new VectorUDT,
    +      new ArrayType(DoubleType, true),
    --- End diff --
    
    Also, IntelliJ may warn you about passing boolean arguments as named arguments; that'd be nice to fix here.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21081: [SPARK-23975][ML]Allow Clustering to take Arrays of Doub...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21081
  
    **[Test build #89736 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89736/testReport)** for PR 21081 at commit [`3ffb322`](https://github.com/apache/spark/commit/3ffb32291503a628c63a7014d27f2313f04c5497).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21081: [SPARK-23975][ML]Allow Clustering to take Arrays of Doub...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/21081
  
    @WeichenXu123  A generic vector class would be interesting, but that would be a big project, way out of scope of this PR.  You could bring it up if that person on the dev list sends a SPIP about linear algebra.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21081: [SPARK-23975][ML]Allow Clustering to take Arrays of Doub...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21081
  
    **[Test build #89738 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89738/testReport)** for PR 21081 at commit [`fee36ad`](https://github.com/apache/spark/commit/fee36ad2deaa013c8fa61cae399a274cd7681fa9).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21081#discussion_r182215639
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala ---
    @@ -123,7 +128,21 @@ class KMeansModel private[ml] (
       @Since("2.0.0")
       override def transform(dataset: Dataset[_]): DataFrame = {
         transformSchema(dataset.schema, logging = true)
    -    val predictUDF = udf((vector: Vector) => predict(vector))
    +    // val predictUDF = udf((vector: Vector) => predict(vector))
    +    val predictUDF = if (dataset.schema($(featuresCol)).dataType.equals(new VectorUDT)) {
    +      udf((vector: Vector) => predict(vector))
    +    }
    +    else {
    +      udf((vector: Seq[_]) => {
    --- End diff --
    
    scala style: remove unnecessary ```{``` at end of line (IntelliJ should warn you about this)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21081#discussion_r182924903
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala ---
    @@ -120,11 +123,32 @@ class KMeansModel private[ml] (
       @Since("2.0.0")
       def setPredictionCol(value: String): this.type = set(predictionCol, value)
     
    +  @Since("2.4.0")
    +  def featureToVector(dataset: Dataset[_], col: Column): Column = {
    +    val featuresDataType = dataset.schema(getFeaturesCol).dataType
    +    val transferUDF = featuresDataType match {
    +      case _: VectorUDT => udf((vector: Vector) => vector)
    --- End diff --
    
    Just return ```col(getFeaturesCol)``` since that will be more efficient.  (Calling a UDF requires data serialization overhead.)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21081: [SPARK-23975][ML]Allow Clustering to take Arrays of Doub...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21081
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21081#discussion_r181846789
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala ---
    @@ -123,8 +128,15 @@ class KMeansModel private[ml] (
       @Since("2.0.0")
       override def transform(dataset: Dataset[_]): DataFrame = {
         transformSchema(dataset.schema, logging = true)
    -    val predictUDF = udf((vector: Vector) => predict(vector))
    -    dataset.withColumn($(predictionCol), predictUDF(col($(featuresCol))))
    +    // val predictUDF = udf((vector: Vector) => predict(vector))
    +    if (dataset.schema($(featuresCol)).dataType.equals(new VectorUDT)) {
    --- End diff --
    
    tip: This can be more succinct if written as:
    ```
    val predictUDF = if (dataset.schema(...).dataType.equals(...)) { A } else { B }
    dataset.withColumn($(predictionCol), predictUDF(col($(featuresCol))))  // so this line is only written once
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21081: [SPARK-23975][ML]Allow Clustering to take Arrays of Doub...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21081
  
    **[Test build #4157 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4157/testReport)** for PR 21081 at commit [`c4e1a51`](https://github.com/apache/spark/commit/c4e1a51551993008d7b082b112d2296cbc4eb97b).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21081: [SPARK-23975][ML]Allow Clustering to take Arrays of Doub...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21081
  
    **[Test build #89749 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89749/testReport)** for PR 21081 at commit [`c4e1a51`](https://github.com/apache/spark/commit/c4e1a51551993008d7b082b112d2296cbc4eb97b).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21081#discussion_r181847695
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala ---
    @@ -312,6 +329,8 @@ class KMeans @Since("1.5.0") (
         val handlePersistence = dataset.storageLevel == StorageLevel.NONE
         val instances: RDD[OldVector] = dataset.select(col($(featuresCol))).rdd.map {
           case Row(point: Vector) => OldVectors.fromML(point)
    +      case Row(point: Seq[_]) =>
    +        OldVectors.fromML(Vectors.dense(point.asInstanceOf[Seq[Double]].toArray))
    --- End diff --
    
    I'm not sure this will work with arrays of FloatType.  Make sure to test it


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21081: [SPARK-23975][ML]Allow Clustering to take Arrays of Doub...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/21081
  
    So why not design generic vector class ? and then implement Vector[Double] and Vector[Float] via generic specification ? So it can support everything, no matter sparse and dense.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21081: [SPARK-23975][ML]Allow Clustering to take Arrays of Doub...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21081
  
    **[Test build #89736 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89736/testReport)** for PR 21081 at commit [`3ffb322`](https://github.com/apache/spark/commit/3ffb32291503a628c63a7014d27f2313f04c5497).
     * This patch **fails Spark unit tests**.
     * This patch **does not merge cleanly**.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21081#discussion_r181840894
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
    @@ -194,6 +195,34 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultR
         assert(e.getCause.getMessage.contains("Cosine distance is not defined"))
       }
     
    +  test("KMean with Array input") {
    +    val featuresColName = "array_model_features"
    +
    +    val arrayUDF = udf { (features: Vector) =>
    +      features.toArray
    +    }
    +    val newdataset = dataset.withColumn(featuresColName, arrayUDF(col("features")) )
    --- End diff --
    
    nit: You could drop the original column as well just to make extra sure that it's not being accidentally used.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21081: [SPARK-23975][ML]Allow Clustering to take Arrays of Doub...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21081
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89660/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21081: [SPARK-23975][ML]Allow Clustering to take Arrays of Doub...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21081
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21081: [SPARK-23975][ML]Allow Clustering to take Arrays of Doub...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21081
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89736/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21081#discussion_r183558105
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
    @@ -199,6 +201,47 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultR
         assert(e.getCause.getMessage.contains("Cosine distance is not defined"))
       }
     
    +  test("KMean with Array input") {
    +    val featuresColNameD = "array_double_features"
    +    val featuresColNameF = "array_float_features"
    +
    +    val doubleUDF = udf { (features: Vector) =>
    +      val featureArray = Array.fill[Double](features.size)(0.0)
    +      features.foreachActive((idx, value) => featureArray(idx) = value.toFloat)
    +      featureArray
    +    }
    +    val floatUDF = udf { (features: Vector) =>
    +      val featureArray = Array.fill[Float](features.size)(0.0f)
    +      features.foreachActive((idx, value) => featureArray(idx) = value.toFloat)
    +      featureArray
    +    }
    +
    +    val newdatasetD = dataset.withColumn(featuresColNameD, doubleUDF(col("features")))
    +      .drop("features")
    +    val newdatasetF = dataset.withColumn(featuresColNameF, floatUDF(col("features")))
    +      .drop("features")
    +
    +    assert(newdatasetD.schema(featuresColNameD).dataType.equals(new ArrayType(DoubleType, false)))
    +    assert(newdatasetF.schema(featuresColNameF).dataType.equals(new ArrayType(FloatType, false)))
    +
    +    val kmeansD = new KMeans().setK(k).setFeaturesCol(featuresColNameD).setSeed(1)
    +    val kmeansF = new KMeans().setK(k).setFeaturesCol(featuresColNameF).setSeed(1)
    +    val modelD = kmeansD.fit(newdatasetD)
    +    val modelF = kmeansF.fit(newdatasetF)
    +
    +    val transformedD = modelD.transform(newdatasetD)
    +    val transformedF = modelF.transform(newdatasetF)
    +
    +    val predictDifference = transformedD.select("prediction")
    +      .except(transformedF.select("prediction"))
    +
    +    assert(predictDifference.count() == 0)
    +
    +    assert(modelD.computeCost(newdatasetD) == modelF.computeCost(newdatasetF) )
    +
    --- End diff --
    
    nit: remove unnecessary newline


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org