You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by WeichenXu123 <gi...@git.apache.org> on 2017/10/31 14:47:34 UTC

[GitHub] spark pull request #19621: [SPARK-11215][ml] Add multiple columns support to...

GitHub user WeichenXu123 opened a pull request:

    https://github.com/apache/spark/pull/19621

    [SPARK-11215][ml] Add multiple columns support to StringIndexer

    ## What changes were proposed in this pull request?
    
    Add multiple columns support to StringIndexer.
    
    ## How was this patch tested?
    
    UT will add soon


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/WeichenXu123/spark multi-col-string-indexer

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19621.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19621
    
----
commit faa839052ebe0949882b7b8c0ca3557f26beb8e8
Author: WeichenXu <we...@databricks.com>
Date:   2017-10-31T14:45:34Z

    init pr

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    @felixcheung Yes, the spark.mlp test result changed because of indexer order changed. That's because, StringIndexer when item frequency equal, there's no definite rule for index order. And, in this PR, I change the code logic in StringIndexer, **but it cannot make sure to generate exactly the same indexer order, because when item frequency equal, there's no definite rule for index order.**


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    You can change the dataset used in testing.
    Will be good if you could test with the same data before and after your change to make sure that’s not broken.
    
    
    
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    **[Test build #84955 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84955/testReport)** for PR 19621 at commit [`bb209c8`](https://github.com/apache/spark/commit/bb209c80395cce84466a8fb8e0c58ca151b791ab).
     * This patch **fails SparkR unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19621: [SPARK-11215][ML] Add multiple columns support to...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19621#discussion_r148172976
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala ---
    @@ -130,21 +152,33 @@ class StringIndexer @Since("1.4.0") (
       @Since("1.4.0")
       def setOutputCol(value: String): this.type = set(outputCol, value)
     
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setInputCols(value: Array[String]): this.type = set(inputCols, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setOutputCols(value: Array[String]): this.type = set(outputCols, value)
    +
       @Since("2.0.0")
       override def fit(dataset: Dataset[_]): StringIndexerModel = {
         transformSchema(dataset.schema, logging = true)
    -    val values = dataset.na.drop(Array($(inputCol)))
    -      .select(col($(inputCol)).cast(StringType))
    -      .rdd.map(_.getString(0))
    -    val labels = $(stringOrderType) match {
    -      case StringIndexer.frequencyDesc => values.countByValue().toSeq.sortBy(-_._2)
    -        .map(_._1).toArray
    -      case StringIndexer.frequencyAsc => values.countByValue().toSeq.sortBy(_._2)
    -        .map(_._1).toArray
    -      case StringIndexer.alphabetDesc => values.distinct.collect.sortWith(_ > _)
    -      case StringIndexer.alphabetAsc => values.distinct.collect.sortWith(_ < _)
    +
    +    val labelsArray = for (inputCol <- getInOutCols._1) yield {
    +      val values = dataset.na.drop(Array(inputCol))
    +        .select(col(inputCol).cast(StringType))
    +        .rdd.map(_.getString(0))
    --- End diff --
    
    This gets the values for each input column sequentially. Can we get the values for all input columns at one run?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    Any one can provide some suggestion ? for fixing sparkR glm test failure here.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84125/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    **[Test build #84093 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84093/testReport)** for PR 19621 at commit [`031f53f`](https://github.com/apache/spark/commit/031f53fbd1c112d8f0b37bb29e847cd3184498c6).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    I checked the failed tests in sparkR. There's some trouble in the failed `glm` sparkR tests.
    These tests compare sparkR glm and R-lib glm results on test data "iris", but, what's the string indexer order for R-lib glm ? I check the dataset "iris", the "Species" column has three value "setosa", "versicolor", "virginica", **their frequency are all 50**, and only when `RFormula` index them as: "setosa"->2, "versicolor"->0, "virginica"->1, the result will be the same with R-lib glm. This is a strange indexer order.
    How to set string indexer order for R-lib glm ?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83265/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    @MLnick Will RDD "count by value" aggregation be deterministic ? e.g., 2 RDD with the same elements, but with different element order and different partition number, will `rdd.countByValue().toSeq` keep deterministic ? The shuffling in `countByValue` seems also possible to break determinacy.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    @viirya @MLnick Thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19621: [SPARK-11215][ML] Add multiple columns support to...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19621#discussion_r148174902
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala ---
    @@ -130,21 +152,33 @@ class StringIndexer @Since("1.4.0") (
       @Since("1.4.0")
       def setOutputCol(value: String): this.type = set(outputCol, value)
     
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setInputCols(value: Array[String]): this.type = set(inputCols, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setOutputCols(value: Array[String]): this.type = set(outputCols, value)
    +
       @Since("2.0.0")
       override def fit(dataset: Dataset[_]): StringIndexerModel = {
         transformSchema(dataset.schema, logging = true)
    -    val values = dataset.na.drop(Array($(inputCol)))
    -      .select(col($(inputCol)).cast(StringType))
    -      .rdd.map(_.getString(0))
    -    val labels = $(stringOrderType) match {
    -      case StringIndexer.frequencyDesc => values.countByValue().toSeq.sortBy(-_._2)
    -        .map(_._1).toArray
    -      case StringIndexer.frequencyAsc => values.countByValue().toSeq.sortBy(_._2)
    -        .map(_._1).toArray
    -      case StringIndexer.alphabetDesc => values.distinct.collect.sortWith(_ > _)
    -      case StringIndexer.alphabetAsc => values.distinct.collect.sortWith(_ < _)
    +
    +    val labelsArray = for (inputCol <- getInOutCols._1) yield {
    +      val values = dataset.na.drop(Array(inputCol))
    +        .select(col(inputCol).cast(StringType))
    +        .rdd.map(_.getString(0))
    --- End diff --
    
    Yes you're right, this is what I am going to do.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    **[Test build #84125 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84125/testReport)** for PR 19621 at commit [`66d054a`](https://github.com/apache/spark/commit/66d054a7daca8a82fd1022fe05e766e7f7285028).
     * This patch **fails SparkR unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19621: [SPARK-11215][ML] Add multiple columns support to...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19621#discussion_r152717985
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala ---
    @@ -130,21 +160,49 @@ class StringIndexer @Since("1.4.0") (
       @Since("1.4.0")
       def setOutputCol(value: String): this.type = set(outputCol, value)
     
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setInputCols(value: Array[String]): this.type = set(inputCols, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setOutputCols(value: Array[String]): this.type = set(outputCols, value)
    +
       @Since("2.0.0")
       override def fit(dataset: Dataset[_]): StringIndexerModel = {
         transformSchema(dataset.schema, logging = true)
    -    val values = dataset.na.drop(Array($(inputCol)))
    -      .select(col($(inputCol)).cast(StringType))
    -      .rdd.map(_.getString(0))
    -    val labels = $(stringOrderType) match {
    -      case StringIndexer.frequencyDesc => values.countByValue().toSeq.sortBy(-_._2)
    -        .map(_._1).toArray
    -      case StringIndexer.frequencyAsc => values.countByValue().toSeq.sortBy(_._2)
    -        .map(_._1).toArray
    -      case StringIndexer.alphabetDesc => values.distinct.collect.sortWith(_ > _)
    -      case StringIndexer.alphabetAsc => values.distinct.collect.sortWith(_ < _)
    +
    +    val inputCols = getInOutCols._1
    +
    +    val zeroState = Array.fill(inputCols.length)(new OpenHashMap[String, Long]())
    +
    +    val countByValueArray = dataset.na.drop(inputCols)
    +      .select(inputCols.map(col(_).cast(StringType)): _*)
    +      .rdd.aggregate(zeroState)(
    +      (state: Array[OpenHashMap[String, Long]], row: Row) => {
    +        for (i <- 0 until inputCols.length) {
    +          state(i).changeValue(row.getString(i), 1L, _ + 1)
    +        }
    +        state
    +      },
    +      (state1: Array[OpenHashMap[String, Long]], state2: Array[OpenHashMap[String, Long]]) => {
    +        for (i <- 0 until inputCols.length) {
    +          state2(i).foreach { case (key: String, count: Long) =>
    +            state1(i).changeValue(key, count, _ + count)
    +          }
    +        }
    +        state1
    +      }
    +    )
    +    val labelsArray = countByValueArray.map { countByValue =>
    +      $(stringOrderType) match {
    +        case StringIndexer.frequencyDesc => countByValue.toSeq.sortBy(-_._2).map(_._1).toArray
    +        case StringIndexer.frequencyAsc => countByValue.toSeq.sortBy(_._2).map(_._1).toArray
    +        case StringIndexer.alphabetDesc => countByValue.toSeq.map(_._1).sortWith(_ > _).toArray
    +        case StringIndexer.alphabetAsc => countByValue.toSeq.map(_._1).sortWith(_ < _).toArray
    --- End diff --
    
    If the dataset is large, it might be. We can leave it as it is. If we find it is a bottleneck, we still can revisit it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    **[Test build #84955 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84955/testReport)** for PR 19621 at commit [`bb209c8`](https://github.com/apache/spark/commit/bb209c80395cce84466a8fb8e0c58ca151b791ab).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    It won't be deterministic in the case of different RDDs / partitions / shuffle etc. For a given input RDD it _should_ be deterministic? 
    
    But perhaps we could ensure it by first sorting alphabetically and then by frequency?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    **[Test build #83878 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83878/testReport)** for PR 19621 at commit [`77bea32`](https://github.com/apache/spark/commit/77bea32984b167894be79736f56601a44baaaa99).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19621: [SPARK-11215][ML] Add multiple columns support to...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19621#discussion_r157169491
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala ---
    @@ -79,20 +80,49 @@ private[feature] trait StringIndexerBase extends Params with HasHandleInvalid wi
       @Since("2.3.0")
       def getStringOrderType: String = $(stringOrderType)
     
    +  private[feature] def getInOutCols: (Array[String], Array[String]) = {
    +
    +    require((isSet(inputCol) && isSet(outputCol) && !isSet(inputCols) && !isSet(outputCols)) ||
    +      (!isSet(inputCol) && !isSet(outputCol) && isSet(inputCols) && isSet(outputCols)),
    +      "Only allow to set either inputCol/outputCol, or inputCols/outputCols"
    +    )
    +
    +    if (isSet(inputCol)) {
    +      (Array($(inputCol)), Array($(outputCol)))
    +    } else {
    +      require($(inputCols).length == $(outputCols).length,
    --- End diff --
    
    test added.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    **[Test build #84125 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84125/testReport)** for PR 19621 at commit [`66d054a`](https://github.com/apache/spark/commit/66d054a7daca8a82fd1022fe05e766e7f7285028).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83878/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    I want to ask, for option `StringIndexer.frequencyDesc`, in the case existing two labels which have the same frequency, which of them will be put in the front ?
    If this is not specified, then we should modify some test cases in `RFomular`, which will generate nondeterministic result.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    @MLnick Ah, I don't express it exactly, the first case, what I mean is, sort by frequency, but if the case frequency equal, sort by alphabet.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84101/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    **[Test build #83263 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83263/testReport)** for PR 19621 at commit [`faa8390`](https://github.com/apache/spark/commit/faa839052ebe0949882b7b8c0ca3557f26beb8e8).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    **[Test build #84066 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84066/testReport)** for PR 19621 at commit [`e5db190`](https://github.com/apache/spark/commit/e5db190f9b76e22ea8f665456cba60fd31cc9cf0).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19621: [SPARK-11215][ML] Add multiple columns support to...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19621#discussion_r151338923
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala ---
    @@ -130,21 +160,49 @@ class StringIndexer @Since("1.4.0") (
       @Since("1.4.0")
       def setOutputCol(value: String): this.type = set(outputCol, value)
     
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setInputCols(value: Array[String]): this.type = set(inputCols, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setOutputCols(value: Array[String]): this.type = set(outputCols, value)
    +
       @Since("2.0.0")
       override def fit(dataset: Dataset[_]): StringIndexerModel = {
         transformSchema(dataset.schema, logging = true)
    -    val values = dataset.na.drop(Array($(inputCol)))
    -      .select(col($(inputCol)).cast(StringType))
    -      .rdd.map(_.getString(0))
    -    val labels = $(stringOrderType) match {
    -      case StringIndexer.frequencyDesc => values.countByValue().toSeq.sortBy(-_._2)
    -        .map(_._1).toArray
    -      case StringIndexer.frequencyAsc => values.countByValue().toSeq.sortBy(_._2)
    -        .map(_._1).toArray
    -      case StringIndexer.alphabetDesc => values.distinct.collect.sortWith(_ > _)
    -      case StringIndexer.alphabetAsc => values.distinct.collect.sortWith(_ < _)
    +
    +    val inputCols = getInOutCols._1
    +
    +    val zeroState = Array.fill(inputCols.length)(new OpenHashMap[String, Long]())
    +
    +    val countByValueArray = dataset.na.drop(inputCols)
    +      .select(inputCols.map(col(_).cast(StringType)): _*)
    +      .rdd.aggregate(zeroState)(
    +      (state: Array[OpenHashMap[String, Long]], row: Row) => {
    +        for (i <- 0 until inputCols.length) {
    +          state(i).changeValue(row.getString(i), 1L, _ + 1)
    +        }
    +        state
    +      },
    +      (state1: Array[OpenHashMap[String, Long]], state2: Array[OpenHashMap[String, Long]]) => {
    +        for (i <- 0 until inputCols.length) {
    +          state2(i).foreach { case (key: String, count: Long) =>
    +            state1(i).changeValue(key, count, _ + count)
    +          }
    +        }
    +        state1
    +      }
    +    )
    +    val labelsArray = countByValueArray.map { countByValue =>
    +      $(stringOrderType) match {
    +        case StringIndexer.frequencyDesc => countByValue.toSeq.sortBy(-_._2).map(_._1).toArray
    +        case StringIndexer.frequencyAsc => countByValue.toSeq.sortBy(_._2).map(_._1).toArray
    +        case StringIndexer.alphabetDesc => countByValue.toSeq.map(_._1).sortWith(_ > _).toArray
    +        case StringIndexer.alphabetAsc => countByValue.toSeq.map(_._1).sortWith(_ < _).toArray
    --- End diff --
    
    For `alphabetAsc` and `alphabetDesc`, seems we don't need to aggregate to count by value.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84066/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84955/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    **[Test build #84066 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84066/testReport)** for PR 19621 at commit [`e5db190`](https://github.com/apache/spark/commit/e5db190f9b76e22ea8f665456cba60fd31cc9cf0).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    **[Test build #84093 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84093/testReport)** for PR 19621 at commit [`031f53f`](https://github.com/apache/spark/commit/031f53fbd1c112d8f0b37bb29e847cd3184498c6).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    @viirya Code updated. Thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    @WeichenXu123 with reference to https://github.com/apache/spark/pull/19621#issuecomment-344530228  - the sort is [stable](https://github.com/scala/scala/blob/2.11.x/src/library/scala/collection/SeqLike.scala#L627) with respect to the input collection. So as long as the result of the "count by value" aggregation is deterministic so will the sort order be in the case of equalities. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19621: [SPARK-11215][ML] Add multiple columns support to...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19621#discussion_r152252129
  
    --- Diff: project/MimaExcludes.scala ---
    @@ -82,7 +82,15 @@ object MimaExcludes {
     
         // [SPARK-21087] CrossValidator, TrainValidationSplit expose sub models after fitting: Scala
         ProblemFilters.exclude[FinalClassProblem]("org.apache.spark.ml.tuning.CrossValidatorModel$CrossValidatorModelWriter"),
    -    ProblemFilters.exclude[FinalClassProblem]("org.apache.spark.ml.tuning.TrainValidationSplitModel$TrainValidationSplitModelWriter")
    +    ProblemFilters.exclude[FinalClassProblem]("org.apache.spark.ml.tuning.TrainValidationSplitModel$TrainValidationSplitModelWriter"),
    +
    +    // [SPARK-11215][ML] Add multiple columns support to StringIndexer
    +    ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ml.feature.StringIndexer.validateAndTransformSchema"),
    +    ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ml.feature.StringIndexerModel.validateAndTransformSchema"),
    +    ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.ml.feature.StringIndexerModel.this"),
    +    ProblemFilters.exclude[InheritedNewAbstractMethodProblem]("org.apache.spark.ml.param.shared.HasOutputCols.outputCols"),
    +    ProblemFilters.exclude[InheritedNewAbstractMethodProblem]("org.apache.spark.ml.param.shared.HasOutputCols.getOutputCols"),
    +    ProblemFilters.exclude[InheritedNewAbstractMethodProblem]("org.apache.spark.ml.param.shared.HasOutputCols.org$apache$spark$ml$param$shared$HasOutputCols$_setter_$outputCols_=")
    --- End diff --
    
    Do we need to keep binary compatibility for `validateAndTransformSchema` ? Will user extend this class and override this method ?
    Others relates to `outputCols` parameter.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19621: [SPARK-11215][ML] Add multiple columns support to...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19621#discussion_r151339193
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala ---
    @@ -177,7 +235,8 @@ object StringIndexer extends DefaultParamsReadable[StringIndexer] {
     /**
      * Model fitted by [[StringIndexer]].
      *
    - * @param labels  Ordered list of labels, corresponding to indices to be assigned.
    + * @param labelsArray  Array of Ordered list of labels, corresponding to indices to be assigned
    --- End diff --
    
    `Ordered` -> `ordered`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    I think we need to address that too. Sounds to me these tests aren’t stable before.
    
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    @MLnick How about this way:
    The case "fequencyAsc/Desc", sort first by frequency and then by alphabet,
    The case "alphabetAsc/Desc", sort by alphabet (and if alphabetically equal, the two label should be the same ?)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    @viirya @MLnick Code updated. Thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    @felixcheung Another failed testcase, spark.mlp in sparkR, it also use `RFormula` and it will also generate indeterministic result, see class `MultilayerPerceptronClassifierWrapper` line 78:
    ```
    val rFormula = new RFormula()
          .setFormula(formula)
          .setForceIndexLabel(true)
          .setHandleInvalid(handleInvalid)
    ```
    It can not set the string order and the default `frequencyDesc` order will bring indeterministic result.
    
    If I only modify the testcase in `spark.mlp`, in the future if the `StringIndexer` implementation being changed, those tests will probably be broken again. What do you think of this ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    **[Test build #83872 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83872/testReport)** for PR 19621 at commit [`b0b14b0`](https://github.com/apache/spark/commit/b0b14b0971a7b941abbadf52d03dbb7d77e93adc).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    **[Test build #84101 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84101/testReport)** for PR 19621 at commit [`031f53f`](https://github.com/apache/spark/commit/031f53fbd1c112d8f0b37bb29e847cd3184498c6).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    **[Test build #83263 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83263/testReport)** for PR 19621 at commit [`faa8390`](https://github.com/apache/spark/commit/faa839052ebe0949882b7b8c0ca3557f26beb8e8).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    maybe we could also change the test itself to make it more deterministic?
    we could first create a new test dataset that avoid having frequency values, run it through the original implementation, then run through this new change to make sure they match.
    
    @actuaryzhang do you have any thought on this?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19621: [SPARK-11215][ML] Add multiple columns support to...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19621#discussion_r151344393
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala ---
    @@ -217,69 +289,94 @@ class StringIndexerModel (
       @Since("1.4.0")
       def setOutputCol(value: String): this.type = set(outputCol, value)
     
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setInputCols(value: Array[String]): this.type = set(inputCols, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setOutputCols(value: Array[String]): this.type = set(outputCols, value)
    +
       @Since("2.0.0")
       override def transform(dataset: Dataset[_]): DataFrame = {
    -    if (!dataset.schema.fieldNames.contains($(inputCol))) {
    -      logInfo(s"Input column ${$(inputCol)} does not exist during transformation. " +
    -        "Skip StringIndexerModel.")
    -      return dataset.toDF
    -    }
         transformSchema(dataset.schema, logging = true)
     
    -    val filteredLabels = getHandleInvalid match {
    -      case StringIndexer.KEEP_INVALID => labels :+ "__unknown"
    -      case _ => labels
    -    }
    +    var (inputColNames, outputColNames) = getInOutCols
    +
    +    val outputColumns = new Array[Column](outputColNames.length)
     
    -    val metadata = NominalAttribute.defaultAttr
    -      .withName($(outputCol)).withValues(filteredLabels).toMetadata()
    +    var filteredDataset = dataset
         // If we are skipping invalid records, filter them out.
    -    val (filteredDataset, keepInvalid) = getHandleInvalid match {
    -      case StringIndexer.SKIP_INVALID =>
    +    if (getHandleInvalid == StringIndexer.SKIP_INVALID) {
    +      filteredDataset = dataset.na.drop(inputColNames.filter(
    +        dataset.schema.fieldNames.contains(_)))
    +      for (i <- 0 until inputColNames.length) {
    +        val inputColName = inputColNames(i)
    +        val labelToIndex = labelToIndexArray(i)
             val filterer = udf { label: String =>
               labelToIndex.contains(label)
             }
    -        (dataset.na.drop(Array($(inputCol))).where(filterer(dataset($(inputCol)))), false)
    -      case _ => (dataset, getHandleInvalid == StringIndexer.KEEP_INVALID)
    +        filteredDataset = filteredDataset.where(filterer(dataset(inputColName)))
    +      }
         }
     
    -    val indexer = udf { label: String =>
    -      if (label == null) {
    -        if (keepInvalid) {
    -          labels.length
    -        } else {
    -          throw new SparkException("StringIndexer encountered NULL value. To handle or skip " +
    -            "NULLS, try setting StringIndexer.handleInvalid.")
    -        }
    +    for (i <- 0 until outputColNames.length) {
    +      val inputColName = inputColNames(i)
    +      val outputColName = outputColNames(i)
    +      val labelToIndex = labelToIndexArray(i)
    +      val labels = labelsArray(i)
    +
    +      if (!dataset.schema.fieldNames.contains(inputColName)) {
    +        logInfo(s"Input column ${inputColName} does not exist during transformation. " +
    +          "Skip this column StringIndexerModel transform.")
    +        outputColNames(i) = null
           } else {
    -        if (labelToIndex.contains(label)) {
    -          labelToIndex(label)
    -        } else if (keepInvalid) {
    -          labels.length
    -        } else {
    -          throw new SparkException(s"Unseen label: $label.  To handle unseen labels, " +
    -            s"set Param handleInvalid to ${StringIndexer.KEEP_INVALID}.")
    +        val filteredLabels = getHandleInvalid match {
    +          case StringIndexer.KEEP_INVALID => labelsArray(i) :+ "__unknown"
    +          case _ => labelsArray(i)
             }
    -      }
    -    }.asNondeterministic()
     
    -    filteredDataset.select(col("*"),
    -      indexer(dataset($(inputCol)).cast(StringType)).as($(outputCol), metadata))
    +        val metadata = NominalAttribute.defaultAttr
    +          .withName(outputColName).withValues(filteredLabels).toMetadata()
    +
    +        val keepInvalid = (getHandleInvalid == StringIndexer.KEEP_INVALID)
    +
    +        val indexer = udf { label: String =>
    +          if (label == null) {
    +            if (keepInvalid) {
    +              labels.length
    +            } else {
    +              throw new SparkException("StringIndexer encountered NULL value. To handle or skip " +
    +                "NULLS, try setting StringIndexer.handleInvalid.")
    +            }
    +          } else {
    +            if (labelToIndex.contains(label)) {
    +              labelToIndex(label)
    +            } else if (keepInvalid) {
    +              labels.length
    +            } else {
    +              throw new SparkException(s"Unseen label: $label.  To handle unseen labels, " +
    +                s"set Param handleInvalid to ${StringIndexer.KEEP_INVALID}.")
    +            }
    +          }
    +        }.asNondeterministic()
    +
    +        outputColumns(i) = indexer(dataset(inputColName).cast(StringType))
    +          .as(outputColName, metadata)
    +      }
    +    }
    +    filteredDataset.withColumns(outputColNames.filter(_ != null),
    +      outputColumns.filter(_ != null))
    --- End diff --
    
    In case `outputColNames` and `outputColNames` are empty, `withColumns` might return an empty dataset, not original dataset.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    @felixcheung There is no breaking change. But, we meet some trouble thing about indeterministic behavior. When frequency equal, the indexer result is indeterministic. I already fix those in RFormula test. But, I don't know how to fix the sparkR `glm` test. Because it depends on R-glm library and I don't know how to set the indexer order for R-glm library. Do you know anyone who is familiar with this ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19621: [SPARK-11215][ML] Add multiple columns support to...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 closed the pull request at:

    https://github.com/apache/spark/pull/19621


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    I am too busy recently to fix those failed R tests. Anyone who has spare time can take over this PR and I will help review. Thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83392/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    @felixcheung "iris" is a built-in dataset in R, used in many algo testing, so is it proper to change it ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19621: [SPARK-11215][ML] Add multiple columns support to...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19621#discussion_r152717925
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala ---
    @@ -130,21 +160,49 @@ class StringIndexer @Since("1.4.0") (
       @Since("1.4.0")
       def setOutputCol(value: String): this.type = set(outputCol, value)
     
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setInputCols(value: Array[String]): this.type = set(inputCols, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setOutputCols(value: Array[String]): this.type = set(outputCols, value)
    +
       @Since("2.0.0")
       override def fit(dataset: Dataset[_]): StringIndexerModel = {
         transformSchema(dataset.schema, logging = true)
    -    val values = dataset.na.drop(Array($(inputCol)))
    -      .select(col($(inputCol)).cast(StringType))
    -      .rdd.map(_.getString(0))
    -    val labels = $(stringOrderType) match {
    -      case StringIndexer.frequencyDesc => values.countByValue().toSeq.sortBy(-_._2)
    -        .map(_._1).toArray
    -      case StringIndexer.frequencyAsc => values.countByValue().toSeq.sortBy(_._2)
    -        .map(_._1).toArray
    -      case StringIndexer.alphabetDesc => values.distinct.collect.sortWith(_ > _)
    -      case StringIndexer.alphabetAsc => values.distinct.collect.sortWith(_ < _)
    +
    +    val inputCols = getInOutCols._1
    +
    +    val zeroState = Array.fill(inputCols.length)(new OpenHashMap[String, Long]())
    +
    +    val countByValueArray = dataset.na.drop(inputCols)
    +      .select(inputCols.map(col(_).cast(StringType)): _*)
    +      .rdd.aggregate(zeroState)(
    --- End diff --
    
    Is `treeAggregate` better? I think it should be faster?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84093/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83263/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19621: [SPARK-11215][ML] Add multiple columns support to...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19621#discussion_r157159460
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala ---
    @@ -79,20 +80,49 @@ private[feature] trait StringIndexerBase extends Params with HasHandleInvalid wi
       @Since("2.3.0")
       def getStringOrderType: String = $(stringOrderType)
     
    +  private[feature] def getInOutCols: (Array[String], Array[String]) = {
    +
    +    require((isSet(inputCol) && isSet(outputCol) && !isSet(inputCols) && !isSet(outputCols)) ||
    +      (!isSet(inputCol) && !isSet(outputCol) && isSet(inputCols) && isSet(outputCols)),
    +      "Only allow to set either inputCol/outputCol, or inputCols/outputCols"
    +    )
    +
    +    if (isSet(inputCol)) {
    +      (Array($(inputCol)), Array($(outputCol)))
    +    } else {
    +      require($(inputCols).length == $(outputCols).length,
    --- End diff --
    
    Should add a test case for this


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19621: [SPARK-11215][ML] Add multiple columns support to...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19621#discussion_r151340844
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala ---
    @@ -217,69 +289,94 @@ class StringIndexerModel (
       @Since("1.4.0")
       def setOutputCol(value: String): this.type = set(outputCol, value)
     
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setInputCols(value: Array[String]): this.type = set(inputCols, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setOutputCols(value: Array[String]): this.type = set(outputCols, value)
    +
       @Since("2.0.0")
       override def transform(dataset: Dataset[_]): DataFrame = {
    -    if (!dataset.schema.fieldNames.contains($(inputCol))) {
    -      logInfo(s"Input column ${$(inputCol)} does not exist during transformation. " +
    -        "Skip StringIndexerModel.")
    -      return dataset.toDF
    -    }
         transformSchema(dataset.schema, logging = true)
    --- End diff --
    
    We can skip `StringIndexerModel` too if all input columns don't exist?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    Jenkins retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19621: [SPARK-11215][ML] Add multiple columns support to...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19621#discussion_r151337906
  
    --- Diff: project/MimaExcludes.scala ---
    @@ -82,7 +82,15 @@ object MimaExcludes {
     
         // [SPARK-21087] CrossValidator, TrainValidationSplit expose sub models after fitting: Scala
         ProblemFilters.exclude[FinalClassProblem]("org.apache.spark.ml.tuning.CrossValidatorModel$CrossValidatorModelWriter"),
    -    ProblemFilters.exclude[FinalClassProblem]("org.apache.spark.ml.tuning.TrainValidationSplitModel$TrainValidationSplitModelWriter")
    +    ProblemFilters.exclude[FinalClassProblem]("org.apache.spark.ml.tuning.TrainValidationSplitModel$TrainValidationSplitModelWriter"),
    +
    +    // [SPARK-11215][ML] Add multiple columns support to StringIndexer
    +    ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ml.feature.StringIndexer.validateAndTransformSchema"),
    +    ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ml.feature.StringIndexerModel.validateAndTransformSchema"),
    +    ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.ml.feature.StringIndexerModel.this"),
    +    ProblemFilters.exclude[InheritedNewAbstractMethodProblem]("org.apache.spark.ml.param.shared.HasOutputCols.outputCols"),
    +    ProblemFilters.exclude[InheritedNewAbstractMethodProblem]("org.apache.spark.ml.param.shared.HasOutputCols.getOutputCols"),
    +    ProblemFilters.exclude[InheritedNewAbstractMethodProblem]("org.apache.spark.ml.param.shared.HasOutputCols.org$apache$spark$ml$param$shared$HasOutputCols$_setter_$outputCols_=")
    --- End diff --
    
    Can those cause binary incompatibility issue in user application?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19621: [SPARK-11215][ML] Add multiple columns support to...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19621#discussion_r152252553
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala ---
    @@ -217,69 +289,94 @@ class StringIndexerModel (
       @Since("1.4.0")
       def setOutputCol(value: String): this.type = set(outputCol, value)
     
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setInputCols(value: Array[String]): this.type = set(inputCols, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setOutputCols(value: Array[String]): this.type = set(outputCols, value)
    +
       @Since("2.0.0")
       override def transform(dataset: Dataset[_]): DataFrame = {
    -    if (!dataset.schema.fieldNames.contains($(inputCol))) {
    -      logInfo(s"Input column ${$(inputCol)} does not exist during transformation. " +
    -        "Skip StringIndexerModel.")
    -      return dataset.toDF
    -    }
         transformSchema(dataset.schema, logging = true)
    --- End diff --
    
    updated.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83323/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    Seems in the frequency-based string orders, the order of labels with same frequency is non-deterministic. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83872/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    **[Test build #84101 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84101/testReport)** for PR 19621 at commit [`031f53f`](https://github.com/apache/spark/commit/031f53fbd1c112d8f0b37bb29e847cd3184498c6).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    **[Test build #83878 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83878/testReport)** for PR 19621 at commit [`77bea32`](https://github.com/apache/spark/commit/77bea32984b167894be79736f56601a44baaaa99).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    @WeichenXu123 I will try to look into this today.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    Jenkins, test this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19621: [SPARK-11215][ML] Add multiple columns support to...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19621#discussion_r157158872
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala ---
    @@ -79,20 +80,49 @@ private[feature] trait StringIndexerBase extends Params with HasHandleInvalid wi
       @Since("2.3.0")
       def getStringOrderType: String = $(stringOrderType)
     
    +  private[feature] def getInOutCols: (Array[String], Array[String]) = {
    +
    +    require((isSet(inputCol) && isSet(outputCol) && !isSet(inputCols) && !isSet(outputCols)) ||
    +      (!isSet(inputCol) && !isSet(outputCol) && isSet(inputCols) && isSet(outputCols)),
    +      "Only allow to set either inputCol/outputCol, or inputCols/outputCols"
    --- End diff --
    
    Maybe match the language for the exception message [here](https://github.com/apache/spark/pull/19715/files#diff-bf4cb764860f82d632ac0730e3d8c605R173)?
    
    `StringIndexer only supports setting either inputCol/outputCol or inputCols/outputCols`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    **[Test build #83872 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83872/testReport)** for PR 19621 at commit [`b0b14b0`](https://github.com/apache/spark/commit/b0b14b0971a7b941abbadf52d03dbb7d77e93adc).
     * This patch **fails MiMa tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19621: [SPARK-11215][ML] Add multiple columns support to...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19621#discussion_r152252491
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala ---
    @@ -130,21 +160,49 @@ class StringIndexer @Since("1.4.0") (
       @Since("1.4.0")
       def setOutputCol(value: String): this.type = set(outputCol, value)
     
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setInputCols(value: Array[String]): this.type = set(inputCols, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setOutputCols(value: Array[String]): this.type = set(outputCols, value)
    +
       @Since("2.0.0")
       override def fit(dataset: Dataset[_]): StringIndexerModel = {
         transformSchema(dataset.schema, logging = true)
    -    val values = dataset.na.drop(Array($(inputCol)))
    -      .select(col($(inputCol)).cast(StringType))
    -      .rdd.map(_.getString(0))
    -    val labels = $(stringOrderType) match {
    -      case StringIndexer.frequencyDesc => values.countByValue().toSeq.sortBy(-_._2)
    -        .map(_._1).toArray
    -      case StringIndexer.frequencyAsc => values.countByValue().toSeq.sortBy(_._2)
    -        .map(_._1).toArray
    -      case StringIndexer.alphabetDesc => values.distinct.collect.sortWith(_ > _)
    -      case StringIndexer.alphabetAsc => values.distinct.collect.sortWith(_ < _)
    +
    +    val inputCols = getInOutCols._1
    +
    +    val zeroState = Array.fill(inputCols.length)(new OpenHashMap[String, Long]())
    +
    +    val countByValueArray = dataset.na.drop(inputCols)
    +      .select(inputCols.map(col(_).cast(StringType)): _*)
    +      .rdd.aggregate(zeroState)(
    +      (state: Array[OpenHashMap[String, Long]], row: Row) => {
    +        for (i <- 0 until inputCols.length) {
    +          state(i).changeValue(row.getString(i), 1L, _ + 1)
    +        }
    +        state
    +      },
    +      (state1: Array[OpenHashMap[String, Long]], state2: Array[OpenHashMap[String, Long]]) => {
    +        for (i <- 0 until inputCols.length) {
    +          state2(i).foreach { case (key: String, count: Long) =>
    +            state1(i).changeValue(key, count, _ + count)
    +          }
    +        }
    +        state1
    +      }
    +    )
    +    val labelsArray = countByValueArray.map { countByValue =>
    +      $(stringOrderType) match {
    +        case StringIndexer.frequencyDesc => countByValue.toSeq.sortBy(-_._2).map(_._1).toArray
    +        case StringIndexer.frequencyAsc => countByValue.toSeq.sortBy(_._2).map(_._1).toArray
    +        case StringIndexer.alphabetDesc => countByValue.toSeq.map(_._1).sortWith(_ > _).toArray
    +        case StringIndexer.alphabetAsc => countByValue.toSeq.map(_._1).sortWith(_ < _).toArray
    --- End diff --
    
    Yes, but will aggregate count bring apparent overhead ? I don't want the code including too many `if ..else`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83396/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    stringindexer is set automatically for index column. are we having breaking API change here?
    https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala#L216



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    I think I understand what you are saying but the latest test failure I see it from spark.mlp instead and be results are different from the existing ones.
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    The first case you mention wouldn’t actually end up sorting by freq, no? It
    would have to be the other way around?
    
    For second case, yes equality must mean it is the same string / key so
    shouldn’t actually occur
    On Wed, 22 Nov 2017 at 16:01, WeichenXu <no...@github.com> wrote:
    
    > @MLnick <https://github.com/mlnick> How about this way:
    > The case "fequencyAsc/Desc", sort first by frequency and then by alphabet,
    > The case "alphabetAsc/Desc", sort by alphabet (and if alphabetically
    > equal, the two label should be the same ?)
    >
    > —
    > You are receiving this because you were mentioned.
    > Reply to this email directly, view it on GitHub
    > <https://github.com/apache/spark/pull/19621#issuecomment-346358015>, or mute
    > the thread
    > <https://github.com/notifications/unsubscribe-auth/AA_SB78zMKYwfyjaNMedcxN9GIFdJclNks5s5ClBgaJpZM4QM3Ni>
    > .
    >



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19621
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org