You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by actuaryzhang <gi...@git.apache.org> on 2017/05/06 06:59:37 UTC

[GitHub] spark pull request #17879: [SPARK-20619][ML]StringIndexer supports multiple ...

GitHub user actuaryzhang opened a pull request:

    https://github.com/apache/spark/pull/17879

    [SPARK-20619][ML]StringIndexer supports multiple ways of label ordering

    ## What changes were proposed in this pull request?
    
    StringIndexer maps labels to numbers according to the descending order of label frequency. Other types of ordering (e.g., alphabetical) may be needed in feature ETL, for example, this will affect the result in one-hot encoding. The PR proposes to support other ordering method including alphabetic order and ascending order of label frequency. We add a parameter `stringOrderType` to control how string is ordered which supports four options:
    - 'freq_desc': descending order by label frequency (most frequent label assigned 0)
    - 'freq_asc': ascending order by label frequency (least frequent label assigned 0)
    - 'alphabet_desc': descending alphabetical order
    - 'alphabet_asc': ascending alphabetical order
    
    The default is still descending order of label frequency, so there should be no impact to existing programs.
    
    ## How was this patch tested?
    new test 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/actuaryzhang/spark stringIndexer

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17879.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17879
    
----
commit ffd0cfc755586402c0f22e4458149eed16f98010
Author: Wayne Zhang <ac...@uber.com>
Date:   2017-05-06T06:54:41Z

    StringIndexer supports multiple ways of label ordering

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17879: [SPARK-20619][ML]StringIndexer supports multiple ways of...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17879
  
    **[Test build #76521 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76521/testReport)** for PR 17879 at commit [`97e020f`](https://github.com/apache/spark/commit/97e020f4aba1afcf45c1b10de09d6fbba1551918).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17879: [SPARK-20619][ML] StringIndexer supports multiple...

Posted by actuaryzhang <gi...@git.apache.org>.
Github user actuaryzhang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17879#discussion_r115413770
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala ---
    @@ -131,6 +167,12 @@ object StringIndexer extends DefaultParamsReadable[StringIndexer] {
       private[feature] val KEEP_INVALID: String = "keep"
       private[feature] val supportedHandleInvalids: Array[String] =
         Array(SKIP_INVALID, ERROR_INVALID, KEEP_INVALID)
    +  private[feature] val FREQ_DESC: String = "frequency_desc"
    +  private[feature] val FREQ_ASC: String = "frequency_asc"
    +  private[feature] val ALPHABET_DESC: String = "alphabet_desc"
    +  private[feature] val ALPHABET_ASC: String = "alphabet_asc"
    --- End diff --
    
    @gatorsmile Thanks much for the suggestion. Changed them to lowerCamelCase. 
    @felixcheung Any additional suggestions? 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17879: [SPARK-20619][ML]StringIndexer supports multiple ...

Posted by actuaryzhang <gi...@git.apache.org>.
Github user actuaryzhang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17879#discussion_r115124754
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala ---
    @@ -59,6 +59,28 @@ private[feature] trait StringIndexerBase extends Params with HasInputCol with Ha
       @Since("1.6.0")
       def getHandleInvalid: String = $(handleInvalid)
     
    +  /**
    +   * Param for how to order labels of string column. The first label after ordering is assigned
    +   * an index of 0.
    +   * Options are:
    +   *   - 'freq_desc': descending order by label frequency (most frequent label assigned 0)
    +   *   - 'freq_asc': ascending order by label frequency (least frequent label assigned 0)
    +   *   - 'alphabet_desc': descending alphabetical order
    +   *   - 'alphabet_asc': ascending alphabetical order
    +   * Default is 'freq_desc'.
    +   *
    +   * @group param
    +   */
    +  @Since("2.2.0")
    +  final val stringOrderType: Param[String] = new Param(this, "stringOrderType",
    +    "The method used to order values of input column. " +
    +      s"Supported options: ${StringIndexer.supportedStringOrderType.mkString(", ")}.",
    +    (value: String) => StringIndexer.supportedStringOrderType.contains(value.toLowerCase))
    +
    +  /** @group getParam */
    +  @Since("2.2.0")
    +  def getStringOrderType: String = $(stringOrderType)
    --- End diff --
    
    @viirya Which ML classes were you referring to? I was told not to change the raw values in the getters in other PRs  #16675. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17879: [SPARK-20619][ML]StringIndexer supports multiple ways of...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17879
  
    **[Test build #76521 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76521/testReport)** for PR 17879 at commit [`97e020f`](https://github.com/apache/spark/commit/97e020f4aba1afcf45c1b10de09d6fbba1551918).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17879: [SPARK-20619][ML] StringIndexer supports multiple...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17879#discussion_r115403961
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala ---
    @@ -131,6 +163,12 @@ object StringIndexer extends DefaultParamsReadable[StringIndexer] {
       private[feature] val KEEP_INVALID: String = "keep"
       private[feature] val supportedHandleInvalids: Array[String] =
         Array(SKIP_INVALID, ERROR_INVALID, KEEP_INVALID)
    +  private[feature] val FREQ_DESC: String = "freq_desc"
    --- End diff --
    
    @gatorsmile thought?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17879: [SPARK-20619][ML]StringIndexer supports multiple ways of...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17879
  
    **[Test build #76517 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76517/testReport)** for PR 17879 at commit [`ffd0cfc`](https://github.com/apache/spark/commit/ffd0cfc755586402c0f22e4458149eed16f98010).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17879: [SPARK-20619][ML] StringIndexer supports multiple ways t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17879
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17879: [SPARK-20619][ML]StringIndexer supports multiple ways of...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17879
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76521/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17879: [SPARK-20619][ML]StringIndexer supports multiple ways of...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17879
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17879: [SPARK-20619][ML] StringIndexer supports multiple ways t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17879
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76621/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17879: [SPARK-20619][ML]StringIndexer supports multiple ...

Posted by actuaryzhang <gi...@git.apache.org>.
Github user actuaryzhang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17879#discussion_r115116222
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala ---
    @@ -107,11 +135,15 @@ class StringIndexer @Since("1.4.0") (
       @Since("2.0.0")
       override def fit(dataset: Dataset[_]): StringIndexerModel = {
         transformSchema(dataset.schema, logging = true)
    -    val counts = dataset.na.drop(Array($(inputCol))).select(col($(inputCol)).cast(StringType))
    -      .rdd
    -      .map(_.getString(0))
    -      .countByValue()
    -    val labels = counts.toSeq.sortBy(-_._2).map(_._1).toArray
    +    val values = dataset.na.drop(Array($(inputCol)))
    +      .select(col($(inputCol)).cast(StringType))
    +      .rdd.map(_.getString(0))
    +    val labels = $(stringOrderType) match {
    +      case "freq_desc" => values.countByValue().toSeq.sortBy(-_._2).map(_._1).toArray
    --- End diff --
    
    @viirya Great catch. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17879: [SPARK-20619][ML] StringIndexer supports multiple ways t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17879
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17879: [SPARK-20619][ML] StringIndexer supports multiple ways t...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17879
  
    @actuaryzhang There seems something wrong with Github's webpage, so I can't directly reply the above comment. `ALSModelParams.getColdStartStrategy` is one example.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17879: [SPARK-20619][ML] StringIndexer supports multiple ways t...

Posted by actuaryzhang <gi...@git.apache.org>.
Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/17879
  
    Thanks much!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17879: [SPARK-20619][ML] StringIndexer supports multiple...

Posted by actuaryzhang <gi...@git.apache.org>.
Github user actuaryzhang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17879#discussion_r115414436
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala ---
    @@ -59,6 +59,29 @@ private[feature] trait StringIndexerBase extends Params with HasInputCol with Ha
       @Since("1.6.0")
       def getHandleInvalid: String = $(handleInvalid)
     
    +  /**
    +   * Param for how to order labels of string column. The first label after ordering is assigned
    +   * an index of 0.
    +   * Options are:
    +   *   - 'frequencyDesc': descending order by label frequency (most frequent label assigned 0)
    +   *   - 'frequencyAsc': ascending order by label frequency (least frequent label assigned 0)
    +   *   - 'alphabetDesc': descending alphabetical order
    +   *   - 'alphabetAsc': ascending alphabetical order
    +   * Default is 'frequencyDesc'.
    +   *
    +   * @group param
    +   */
    +  @Since("2.3.0")
    +  final val stringOrderType: Param[String] = new Param(this, "stringOrderType",
    +    "how to order labels of string column. " +
    +    "The first label after ordering is assigned an index of 0. " +
    +    s"Supported options: ${StringIndexer.supportedStringOrderType.mkString(", ")}.",
    +    ParamValidators.inArray(StringIndexer.supportedStringOrderType))
    --- End diff --
    
    @felixcheung  Right. It does not quite make sense to be case-sensitive now given that we now use camel case. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17879: [SPARK-20619][ML]StringIndexer supports multiple ways of...

Posted by actuaryzhang <gi...@git.apache.org>.
Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/17879
  
    @viirya Thanks much for your comments. Made a new commit to address them. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17879: [SPARK-20619][ML]StringIndexer supports multiple ways of...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17879
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17879: [SPARK-20619][ML] StringIndexer supports multiple...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17879#discussion_r115414165
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala ---
    @@ -59,6 +59,29 @@ private[feature] trait StringIndexerBase extends Params with HasInputCol with Ha
       @Since("1.6.0")
       def getHandleInvalid: String = $(handleInvalid)
     
    +  /**
    +   * Param for how to order labels of string column. The first label after ordering is assigned
    +   * an index of 0.
    +   * Options are:
    +   *   - 'frequencyDesc': descending order by label frequency (most frequent label assigned 0)
    +   *   - 'frequencyAsc': ascending order by label frequency (least frequent label assigned 0)
    +   *   - 'alphabetDesc': descending alphabetical order
    +   *   - 'alphabetAsc': ascending alphabetical order
    +   * Default is 'frequencyDesc'.
    +   *
    +   * @group param
    +   */
    +  @Since("2.3.0")
    +  final val stringOrderType: Param[String] = new Param(this, "stringOrderType",
    +    "how to order labels of string column. " +
    +    "The first label after ordering is assigned an index of 0. " +
    +    s"Supported options: ${StringIndexer.supportedStringOrderType.mkString(", ")}.",
    +    ParamValidators.inArray(StringIndexer.supportedStringOrderType))
    --- End diff --
    
    so we are going to case sensitive then?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17879: [SPARK-20619][ML] StringIndexer supports multiple...

Posted by actuaryzhang <gi...@git.apache.org>.
Github user actuaryzhang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17879#discussion_r115402478
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala ---
    @@ -131,6 +163,12 @@ object StringIndexer extends DefaultParamsReadable[StringIndexer] {
       private[feature] val KEEP_INVALID: String = "keep"
       private[feature] val supportedHandleInvalids: Array[String] =
         Array(SKIP_INVALID, ERROR_INVALID, KEEP_INVALID)
    +  private[feature] val FREQ_DESC: String = "freq_desc"
    --- End diff --
    
    @felixcheung I did not find any prior standard, and am open to suggestion for better names. 
    Maybe better use `frequency_desc` or `count_desc`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17879: [SPARK-20619][ML] StringIndexer supports multiple ways t...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17879
  
    **[Test build #76621 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76621/testReport)** for PR 17879 at commit [`ff9b1d6`](https://github.com/apache/spark/commit/ff9b1d66873eb8cad1a4a13f323555da2706a849).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17879: [SPARK-20619][ML] StringIndexer supports multiple ways t...

Posted by actuaryzhang <gi...@git.apache.org>.
Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/17879
  
    @felixcheung Thanks. I will update the annotation. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17879: [SPARK-20619][ML]StringIndexer supports multiple ways of...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17879
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76517/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17879: [SPARK-20619][ML] StringIndexer supports multiple...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/17879


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17879: [SPARK-20619][ML] StringIndexer supports multiple...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17879#discussion_r115401566
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala ---
    @@ -131,6 +163,12 @@ object StringIndexer extends DefaultParamsReadable[StringIndexer] {
       private[feature] val KEEP_INVALID: String = "keep"
       private[feature] val supportedHandleInvalids: Array[String] =
         Array(SKIP_INVALID, ERROR_INVALID, KEEP_INVALID)
    +  private[feature] val FREQ_DESC: String = "freq_desc"
    --- End diff --
    
    is there any prior standard for these names like `freq_desc`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17879: [SPARK-20619][ML]StringIndexer supports multiple ...

Posted by actuaryzhang <gi...@git.apache.org>.
Github user actuaryzhang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17879#discussion_r115116212
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala ---
    @@ -59,6 +59,28 @@ private[feature] trait StringIndexerBase extends Params with HasInputCol with Ha
       @Since("1.6.0")
       def getHandleInvalid: String = $(handleInvalid)
     
    +  /**
    +   * Param for how to order labels of string column. The first label after ordering is assigned
    +   * an index of 0.
    +   * Options are:
    +   *   - 'freq_desc': descending order by label frequency (most frequent label assigned 0)
    +   *   - 'freq_asc': ascending order by label frequency (least frequent label assigned 0)
    +   *   - 'alphabet_desc': descending alphabetical order
    +   *   - 'alphabet_asc': ascending alphabetical order
    +   * Default is 'freq_desc'.
    +   *
    +   * @group param
    +   */
    +  @Since("2.2.0")
    +  final val stringOrderType: Param[String] = new Param(this, "stringOrderType",
    +    "The method used to order values of input column. " +
    +      s"Supported options: ${StringIndexer.supportedStringOrderType.mkString(", ")}.",
    +    (value: String) => StringIndexer.supportedStringOrderType.contains(value.toLowerCase))
    --- End diff --
    
    @viirya `ParamValidators.inArray` does not allow case-insensitive validation, does it? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17879: [SPARK-20619][ML] StringIndexer supports multiple ways t...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17879
  
    **[Test build #76628 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76628/testReport)** for PR 17879 at commit [`53381ea`](https://github.com/apache/spark/commit/53381ea6ba41cc26ed89a6fc42252f7126198d9f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17879: [SPARK-20619][ML]StringIndexer supports multiple ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17879#discussion_r115115666
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala ---
    @@ -59,6 +59,28 @@ private[feature] trait StringIndexerBase extends Params with HasInputCol with Ha
       @Since("1.6.0")
       def getHandleInvalid: String = $(handleInvalid)
     
    +  /**
    +   * Param for how to order labels of string column. The first label after ordering is assigned
    +   * an index of 0.
    +   * Options are:
    +   *   - 'freq_desc': descending order by label frequency (most frequent label assigned 0)
    +   *   - 'freq_asc': ascending order by label frequency (least frequent label assigned 0)
    +   *   - 'alphabet_desc': descending alphabetical order
    +   *   - 'alphabet_asc': ascending alphabetical order
    +   * Default is 'freq_desc'.
    +   *
    +   * @group param
    +   */
    +  @Since("2.2.0")
    +  final val stringOrderType: Param[String] = new Param(this, "stringOrderType",
    +    "The method used to order values of input column. " +
    +      s"Supported options: ${StringIndexer.supportedStringOrderType.mkString(", ")}.",
    +    (value: String) => StringIndexer.supportedStringOrderType.contains(value.toLowerCase))
    --- End diff --
    
    Use `ParamValidators.inArray`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17879: [SPARK-20619][ML] StringIndexer supports multiple ways t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17879
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76619/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17879: [SPARK-20619][ML] StringIndexer supports multiple ways t...

Posted by actuaryzhang <gi...@git.apache.org>.
Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/17879
  
    Thanks much @felixcheung and @viirya. I have addressed your comments. 
    - update from 2.2 to 2.3
    - change `freq_desc` to `frequency_desc`. 
    - move toLowerCase to the getter method. 
    Please let me know if there is anything needed. Thanks!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17879: [SPARK-20619][ML] StringIndexer supports multiple...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17879#discussion_r115409190
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala ---
    @@ -131,6 +167,12 @@ object StringIndexer extends DefaultParamsReadable[StringIndexer] {
       private[feature] val KEEP_INVALID: String = "keep"
       private[feature] val supportedHandleInvalids: Array[String] =
         Array(SKIP_INVALID, ERROR_INVALID, KEEP_INVALID)
    +  private[feature] val FREQ_DESC: String = "frequency_desc"
    +  private[feature] val FREQ_ASC: String = "frequency_asc"
    +  private[feature] val ALPHABET_DESC: String = "alphabet_desc"
    +  private[feature] val ALPHABET_ASC: String = "alphabet_asc"
    --- End diff --
    
    Normally, we do not use underscore in the names. `lowerCamelCase` is our rules for naming. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17879: [SPARK-20619][ML] StringIndexer supports multiple ways t...

Posted by actuaryzhang <gi...@git.apache.org>.
Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/17879
  
    @felixcheung @gatorsmile @MLnick @jkbradley @holdenk @yanboliang @srowen @sethah 
    Would you please take another look and let me know any additional suggestions? Thanks much!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17879: [SPARK-20619][ML]StringIndexer supports multiple ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17879#discussion_r115116419
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala ---
    @@ -59,6 +59,28 @@ private[feature] trait StringIndexerBase extends Params with HasInputCol with Ha
       @Since("1.6.0")
       def getHandleInvalid: String = $(handleInvalid)
     
    +  /**
    +   * Param for how to order labels of string column. The first label after ordering is assigned
    +   * an index of 0.
    +   * Options are:
    +   *   - 'freq_desc': descending order by label frequency (most frequent label assigned 0)
    +   *   - 'freq_asc': ascending order by label frequency (least frequent label assigned 0)
    +   *   - 'alphabet_desc': descending alphabetical order
    +   *   - 'alphabet_asc': ascending alphabetical order
    +   * Default is 'freq_desc'.
    +   *
    +   * @group param
    +   */
    +  @Since("2.2.0")
    +  final val stringOrderType: Param[String] = new Param(this, "stringOrderType",
    +    "The method used to order values of input column. " +
    +      s"Supported options: ${StringIndexer.supportedStringOrderType.mkString(", ")}.",
    +    (value: String) => StringIndexer.supportedStringOrderType.contains(value.toLowerCase))
    --- End diff --
    
    Yeah, I originally thought you'd change to case-sensitive. It looks good to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17879: [SPARK-20619][ML] StringIndexer supports multiple ways t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17879
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76628/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17879: [SPARK-20619][ML] StringIndexer supports multiple ways t...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17879
  
    LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17879: [SPARK-20619][ML] StringIndexer supports multiple ways t...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17879
  
    **[Test build #76624 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76624/testReport)** for PR 17879 at commit [`07198d9`](https://github.com/apache/spark/commit/07198d9bb45a54d3c257ad37e772cc31154ffcb6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17879: [SPARK-20619][ML] StringIndexer supports multiple ways t...

Posted by actuaryzhang <gi...@git.apache.org>.
Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/17879
  
    @felixcheung 
    It would be great if you could help merge this. I could address comments (if any) in a future PR.
    This seems a pretty straightforward change that removes a big blocker. I will continue working on the RFormula side once this merges in and fix the SparkR issues related to string ordering. 
    
    Thanks much! 
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17879: [SPARK-20619][ML]StringIndexer supports multiple ways of...

Posted by actuaryzhang <gi...@git.apache.org>.
Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/17879
  
    @holdenk The main motivation for this PR is that the behavior of StringIndexer will affect OneHotEncoder, RFormula and models estimated based on these transformers. There have been a few desired improvement in RFormula that could not be done without the change in StringIndexer.
    
    One use case for alphabetical ordering is to make comparison of Spark model results to that in R, which drops the first alphabetical value in one-hot encoding. Right now, even though we do lots of comparisons between Spark and R, we lack comparisons involving String features because the encoding is different. There is already a [JIRA|https://issues.apache.org/jira/browse/SPARK-14659]. 
    
    Another motivation for this PR is to support ascending order by label frequency. This is also related to one-hot encoding. In practical applications of regression type models, it is almost always better to set the most frequent label as the reference level (i.e., drop the most frequent label in OneHotEncoding) for better interpretability. Right now, the behavior is the opposite and has made it very difficult to interpret results. 
    
    I think  the flexibility of different ordering will benefit a lot the downstream feature transformers and model estimators. Does this make sense? 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17879: [SPARK-20619][ML]StringIndexer supports multiple ways of...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/17879
  
    What would be some common cases where alphabet ordering would be needed?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17879: [SPARK-20619][ML]StringIndexer supports multiple ways of...

Posted by actuaryzhang <gi...@git.apache.org>.
Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/17879
  
    @jkbradley @MLnick @holdenk @pnpritchard @yanboliang @sethah @imatiach-msft @srowen 
    
    An example to illustrate the idea:
    ```
    val data = Seq((0, "b"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "b"))
    val df = data.toDF("id", "label")
    val indexer = new StringIndexer()
          .setInputCol("label")
          .setOutputCol("labelIndex")
    df.show
    +---+-----+
    | id|label|
    +---+-----+
    |  0|    b|
    |  1|    b|
    |  2|    c|
    |  3|    a|
    |  4|    a|
    |  5|    b|
    +---+-----+
    ```
    
    Below is the result corresponding to the different types of label ordering. 
    ```
    indexer.setStringOrderType("freq_desc").fit(df).transform(df)
    +---+-----+----------+
    | id|label|labelIndex|
    +---+-----+----------+
    |  0|    b|       0.0|
    |  1|    b|       0.0|
    |  2|    c|       2.0|
    |  3|    a|       1.0|
    |  4|    a|       1.0|
    |  5|    b|       0.0|
    +---+-----+----------+
    
    indexer.setStringOrderType("freq_asc").fit(df).transform(df)
    +---+-----+----------+
    | id|label|labelIndex|
    +---+-----+----------+
    |  0|    b|       2.0|
    |  1|    b|       2.0|
    |  2|    c|       0.0|
    |  3|    a|       1.0|
    |  4|    a|       1.0|
    |  5|    b|       2.0|
    +---+-----+----------+
    
    indexer.setStringOrderType("alphabet_desc").fit(df).transform(df)
    +---+-----+----------+
    | id|label|labelIndex|
    +---+-----+----------+
    |  0|    b|       1.0|
    |  1|    b|       1.0|
    |  2|    c|       0.0|
    |  3|    a|       2.0|
    |  4|    a|       2.0|
    |  5|    b|       1.0|
    +---+-----+----------+
    
    indexer.setStringOrderType("alphabet_asc").fit(df).transform(df)
    +---+-----+----------+
    | id|label|labelIndex|
    +---+-----+----------+
    |  0|    b|       1.0|
    |  1|    b|       1.0|
    |  2|    c|       2.0|
    |  3|    a|       0.0|
    |  4|    a|       0.0|
    |  5|    b|       1.0|
    +---+-----+----------+
    
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17879: [SPARK-20619][ML] StringIndexer supports multiple ways t...

Posted by actuaryzhang <gi...@git.apache.org>.
Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/17879
  
    @felixcheung Thanks much for your review. 
    @yanboliang @jkbradley Since there are two approvals, could you guys take a look and merge if it's good? We really need this for a couple of SparkR related issues, e.g., SPARK-14659 and SPARK-14657. Thanks much!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17879: [SPARK-20619][ML]StringIndexer supports multiple ways of...

Posted by actuaryzhang <gi...@git.apache.org>.
Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/17879
  
    @yanboliang Since you have reported a few issues due to different encoding between Spark and R (e.g., #SPARK-14659 and #SPARK-14657), probably you could add some comments? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17879: [SPARK-20619][ML] StringIndexer supports multiple ways t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17879
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17879: [SPARK-20619][ML] StringIndexer supports multiple ways t...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/17879
  
    merged to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17879: [SPARK-20619][ML] StringIndexer supports multiple ways t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17879
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76624/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17879: [SPARK-20619][ML] StringIndexer supports multiple ways t...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17879
  
    **[Test build #76621 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76621/testReport)** for PR 17879 at commit [`ff9b1d6`](https://github.com/apache/spark/commit/ff9b1d66873eb8cad1a4a13f323555da2706a849).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17879: [SPARK-20619][ML]StringIndexer supports multiple ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17879#discussion_r115116845
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala ---
    @@ -59,6 +59,28 @@ private[feature] trait StringIndexerBase extends Params with HasInputCol with Ha
       @Since("1.6.0")
       def getHandleInvalid: String = $(handleInvalid)
     
    +  /**
    +   * Param for how to order labels of string column. The first label after ordering is assigned
    +   * an index of 0.
    +   * Options are:
    +   *   - 'freq_desc': descending order by label frequency (most frequent label assigned 0)
    +   *   - 'freq_asc': ascending order by label frequency (least frequent label assigned 0)
    +   *   - 'alphabet_desc': descending alphabetical order
    +   *   - 'alphabet_asc': ascending alphabetical order
    +   * Default is 'freq_desc'.
    +   *
    +   * @group param
    +   */
    +  @Since("2.2.0")
    +  final val stringOrderType: Param[String] = new Param(this, "stringOrderType",
    +    "The method used to order values of input column. " +
    +      s"Supported options: ${StringIndexer.supportedStringOrderType.mkString(", ")}.",
    +    (value: String) => StringIndexer.supportedStringOrderType.contains(value.toLowerCase))
    +
    +  /** @group getParam */
    +  @Since("2.2.0")
    +  def getStringOrderType: String = $(stringOrderType)
    --- End diff --
    
    I checked other ML classes. Looks like for a case-insensitive setting, we may do toLowerCase in its public API:
    
        def getStringOrderType: String = $(stringOrderType).toLowerCase
    
    And you can use `getStringOrderType` below instead of `$(stringOrderType).toLowerCase` in `fit`.
     


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17879: [SPARK-20619][ML] StringIndexer supports multiple ways t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17879
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17879: [SPARK-20619][ML]StringIndexer supports multiple ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17879#discussion_r115115725
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala ---
    @@ -107,11 +135,15 @@ class StringIndexer @Since("1.4.0") (
       @Since("2.0.0")
       override def fit(dataset: Dataset[_]): StringIndexerModel = {
         transformSchema(dataset.schema, logging = true)
    -    val counts = dataset.na.drop(Array($(inputCol))).select(col($(inputCol)).cast(StringType))
    -      .rdd
    -      .map(_.getString(0))
    -      .countByValue()
    -    val labels = counts.toSeq.sortBy(-_._2).map(_._1).toArray
    +    val values = dataset.na.drop(Array($(inputCol)))
    +      .select(col($(inputCol)).cast(StringType))
    +      .rdd.map(_.getString(0))
    +    val labels = $(stringOrderType) match {
    +      case "freq_desc" => values.countByValue().toSeq.sortBy(-_._2).map(_._1).toArray
    --- End diff --
    
    Seems this setting is case-insensitive because your param validator doesn't care it, `$(stringOrderType)` might be upper-case here. We may make it case-sensitive or do `toLowerCase` here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17879: [SPARK-20619][ML] StringIndexer supports multiple ways t...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17879
  
    **[Test build #76619 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76619/testReport)** for PR 17879 at commit [`ba34043`](https://github.com/apache/spark/commit/ba340437fee99f848dfa88eab2e10d87651eab0a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17879: [SPARK-20619][ML] StringIndexer supports multiple ways t...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17879
  
    **[Test build #76628 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76628/testReport)** for PR 17879 at commit [`53381ea`](https://github.com/apache/spark/commit/53381ea6ba41cc26ed89a6fc42252f7126198d9f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17879: [SPARK-20619][ML] StringIndexer supports multiple ways t...

Posted by actuaryzhang <gi...@git.apache.org>.
Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/17879
  
    ping @yanboliang @felixcheung 
    This is needed for one-hot encoding to be consistent with R, therefore enabling directly comparison of Spark results to R. Could you guys please take a look? Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17879: [SPARK-20619][ML]StringIndexer supports multiple ways of...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17879
  
    **[Test build #76517 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76517/testReport)** for PR 17879 at commit [`ffd0cfc`](https://github.com/apache/spark/commit/ffd0cfc755586402c0f22e4458149eed16f98010).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17879: [SPARK-20619][ML] StringIndexer supports multiple ways t...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17879
  
    **[Test build #76619 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76619/testReport)** for PR 17879 at commit [`ba34043`](https://github.com/apache/spark/commit/ba340437fee99f848dfa88eab2e10d87651eab0a).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17879: [SPARK-20619][ML] StringIndexer supports multiple ways t...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17879
  
    **[Test build #76624 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76624/testReport)** for PR 17879 at commit [`07198d9`](https://github.com/apache/spark/commit/07198d9bb45a54d3c257ad37e772cc31154ffcb6).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org