You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by actuaryzhang <gi...@git.apache.org> on 2017/05/12 17:56:24 UTC

[GitHub] spark pull request #17967: [SPARK-14659] RFormula allows to drop the same ca...

GitHub user actuaryzhang opened a pull request:

    https://github.com/apache/spark/pull/17967

    [SPARK-14659] RFormula allows to drop the same category as R when handling strings

    ## What changes were proposed in this pull request?
    When handling strings, the category dropped by RFormula and R are different:
    - RFormula drops the least frequent level 
    - R drops the last level after ascending alphabetical ordering 
    
    This PR supports different string ordering type in StringIndexer #17879 so that RFormula can drop the same level as R when handling strings when`stringOrderType = "alphabetAsc"`. 
    
    ## How was this patch tested?
    new tests 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/actuaryzhang/spark RFormula

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17967.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17967
    
----
commit 4d27123926ee87231a73aea9dc34555c404c7f1b
Author: Wayne Zhang <ac...@uber.com>
Date:   2017-05-12T07:52:13Z

    add stringOrderType to RFormula

commit 6841c33768adf1b1397dc5aa36e34abdb8d6ff8a
Author: Wayne Zhang <ac...@uber.com>
Date:   2017-05-12T16:30:12Z

    clean up import

commit 77fe864770420719d396715479fc1f452a80b8da
Author: Wayne Zhang <ac...@uber.com>
Date:   2017-05-12T17:48:44Z

    add comparison to R

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    **[Test build #77110 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77110/testReport)** for PR 17967 at commit [`5f31d31`](https://github.com/apache/spark/commit/5f31d311c0c39da1968686dd4147376b3888cee3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula allows to drop the same categ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76878/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17967: [SPARK-14659][ML] RFormula consistent with R when...

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/17967#discussion_r117873500

--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala ---
@@ -37,6 +37,42 @@ import org.apache.spark.sql.types._
*/
private[feature] trait RFormulaBase extends HasFeaturesCol with HasLabelCol {

+ /**
+ * Param for how to order categories of a string FEATURE column used by `StringIndexer`.
+ * The last category after ordering is dropped when encoding strings.
+ * Supported options: 'frequencyDesc', 'frequencyAsc', 'alphabetDesc', 'alphabetAsc'.
+ * The default value is 'frequencyDesc'. When the ordering is set to 'alphabetDesc', `RFormula`
+ * drops the same category as R when encoding strings.
+ *
+ * The options are explained using an example `'b', 'a', 'b', 'a', 'c', 'b'`:
+ * {{{
+ * +-----------------+---------------------------------------+----------------------------------+
--- End diff --

@HyukjinKwon Would you please clarify what you mean by a `list`? Thanks.
I would like to preserve the table structure because it helps show the difference.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17967: [SPARK-14659][ML] RFormula consistent with R when...

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17967#discussion_r117909338
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala ---
    @@ -37,6 +37,42 @@ import org.apache.spark.sql.types._
      */
     private[feature] trait RFormulaBase extends HasFeaturesCol with HasLabelCol {
     
    +  /**
    +   * Param for how to order categories of a string FEATURE column used by `StringIndexer`.
    +   * The last category after ordering is dropped when encoding strings.
    +   * Supported options: 'frequencyDesc', 'frequencyAsc', 'alphabetDesc', 'alphabetAsc'.
    +   * The default value is 'frequencyDesc'. When the ordering is set to 'alphabetDesc', `RFormula`
    +   * drops the same category as R when encoding strings.
    +   *
    +   * The options are explained using an example `'b', 'a', 'b', 'a', 'c', 'b'`:
    +   * {{{
    +   * +-----------------+---------------------------------------+----------------------------------+
    --- End diff --
    
    @felixcheung @HyukjinKwon The scaladoc complied, but the javadoc failed...  Not sure if there is additional config for java? 
    
    ![image](https://cloud.githubusercontent.com/assets/11082368/26341144/048b8d6e-3f47-11e7-8600-c111643a0295.png)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17967: [SPARK-14659][ML] RFormula consistent with R when...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17967#discussion_r117917444
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala ---
    @@ -37,6 +37,42 @@ import org.apache.spark.sql.types._
      */
     private[feature] trait RFormulaBase extends HasFeaturesCol with HasLabelCol {
     
    +  /**
    +   * Param for how to order categories of a string FEATURE column used by `StringIndexer`.
    +   * The last category after ordering is dropped when encoding strings.
    +   * Supported options: 'frequencyDesc', 'frequencyAsc', 'alphabetDesc', 'alphabetAsc'.
    +   * The default value is 'frequencyDesc'. When the ordering is set to 'alphabetDesc', `RFormula`
    +   * drops the same category as R when encoding strings.
    +   *
    +   * The options are explained using an example `'b', 'a', 'b', 'a', 'c', 'b'`:
    +   * {{{
    +   * +-----------------+---------------------------------------+----------------------------------+
    --- End diff --
    
    I think javadoc 8 complains about HTML. It looks this works:
    
    ```html
       * <table summary="abc">
       * <tr>
       *   <th>Firstname</th>
       *   <th>Lastname</th>
       *   <th>Age</th>
       * </tr>
       * <tr>
       *   <td>Jill</td>
       *   <td>Smith</td>
       *   <td>50</td>
       * </tr>
       * <tr>
       *   <td>Eve</td>
       *   <td>Jackson</td>
       *   <td>94</td>
       *  </tr>
       * </table>
    ```
    
    Scaladoc
    
    ![2017-05-23 4 28 14](https://cloud.githubusercontent.com/assets/6477701/26342961/19a85edc-3fd5-11e7-8454-5ebaa29bbafd.png)
    
    
    
    Javadoc
    
    ![2017-05-23 4 27 55](https://cloud.githubusercontent.com/assets/6477701/26342974/23b6ef92-3fd5-11e7-993f-a42c19f5da89.png)
    
    Other errors probably are spurious (please refer https://issues.apache.org/jira/browse/SPARK-20840 which I am fighting with right now).
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    **[Test build #77116 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77116/testReport)** for PR 17967 at commit [`341949c`](https://github.com/apache/spark/commit/341949c4c1e09baa9478e54e06aa1133b3c6fc86).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    @felixcheung @yanboliang I'm fine with either the ascii table or the html table. It's your call. 
    Hope to get over this minor doc issue and get this PR in soon. I can update the doc later if we find a better way. Thanks much. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17967: [SPARK-14659][ML] RFormula consistent with R when...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17967#discussion_r117931896
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala ---
    @@ -37,6 +37,42 @@ import org.apache.spark.sql.types._
      */
     private[feature] trait RFormulaBase extends HasFeaturesCol with HasLabelCol {
     
    +  /**
    +   * Param for how to order categories of a string FEATURE column used by `StringIndexer`.
    +   * The last category after ordering is dropped when encoding strings.
    +   * Supported options: 'frequencyDesc', 'frequencyAsc', 'alphabetDesc', 'alphabetAsc'.
    +   * The default value is 'frequencyDesc'. When the ordering is set to 'alphabetDesc', `RFormula`
    +   * drops the same category as R when encoding strings.
    +   *
    +   * The options are explained using an example `'b', 'a', 'b', 'a', 'c', 'b'`:
    +   * {{{
    +   * +-----------------+---------------------------------------+----------------------------------+
    --- End diff --
    
    Not sure. it did not work for me too ...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17967: [SPARK-14659][ML] RFormula consistent with R when...

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17967#discussion_r116389971
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala ---
    @@ -37,6 +37,29 @@ import org.apache.spark.sql.types._
      */
     private[feature] trait RFormulaBase extends HasFeaturesCol with HasLabelCol {
     
    +  /**
    +   * Param for how to order labels of string column. The first label after ordering is assigned
    +   * an index of 0.
    +   * Options are:
    +   *   - 'frequencyDesc': descending order by label frequency (most frequent label assigned 0)
    +   *   - 'frequencyAsc': ascending order by label frequency (least frequent label assigned 0)
    +   *   - 'alphabetDesc': descending alphabetical order
    +   *   - 'alphabetAsc': ascending alphabetical order
    +   * Default is 'frequencyDesc'.
    +   *
    +   * @group param
    +   */
    +  @Since("2.3.0")
    +  final val stringOrderType: Param[String] = new Param(this, "stringOrderType",
    +    "how to order labels of string column. " +
    --- End diff --
    
    Fixed it, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    This is what we get from the current doc:
    
    ![image](https://cloud.githubusercontent.com/assets/11082368/26430799/dd49fa4c-40a4-11e7-95c6-66def9a8f588.png)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    @viirya Great point. Added a comment to explain this in the doc.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17967: [SPARK-14659][ML] RFormula consistent with R when...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17967#discussion_r117754423
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala ---
    @@ -37,6 +37,42 @@ import org.apache.spark.sql.types._
      */
     private[feature] trait RFormulaBase extends HasFeaturesCol with HasLabelCol {
     
    +  /**
    +   * Param for how to order categories of a string FEATURE column used by `StringIndexer`.
    +   * The last category after ordering is dropped when encoding strings.
    +   * Supported options: 'frequencyDesc', 'frequencyAsc', 'alphabetDesc', 'alphabetAsc'.
    +   * The default value is 'frequencyDesc'. When the ordering is set to 'alphabetDesc', `RFormula`
    +   * drops the same category as R when encoding strings.
    --- End diff --
    
    The order should be ```alphabetAsc``` to match R.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17967: [SPARK-14659][ML] RFormula consistent with R when...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17967#discussion_r117523754
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala ---
    @@ -37,6 +37,31 @@ import org.apache.spark.sql.types._
      */
     private[feature] trait RFormulaBase extends HasFeaturesCol with HasLabelCol {
     
    +  /**
    +   * Param for how to order labels of string column. The first label after ordering is assigned
    --- End diff --
    
    Please mentions of ```label/labels``` in all comments, since we handle all string columns regardless of feature columns or label column. The current description may confused users. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17967: [SPARK-14659][ML] RFormula consistent with R when...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17967#discussion_r117906723
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala ---
    @@ -37,6 +37,42 @@ import org.apache.spark.sql.types._
      */
     private[feature] trait RFormulaBase extends HasFeaturesCol with HasLabelCol {
     
    +  /**
    +   * Param for how to order categories of a string FEATURE column used by `StringIndexer`.
    +   * The last category after ordering is dropped when encoding strings.
    +   * Supported options: 'frequencyDesc', 'frequencyAsc', 'alphabetDesc', 'alphabetAsc'.
    +   * The default value is 'frequencyDesc'. When the ordering is set to 'alphabetDesc', `RFormula`
    +   * drops the same category as R when encoding strings.
    +   *
    +   * The options are explained using an example `'b', 'a', 'b', 'a', 'c', 'b'`:
    +   * {{{
    +   * +-----------------+---------------------------------------+----------------------------------+
    --- End diff --
    
    it's suppose to work with raw html tag? I'm not sure why `<ul>` works but `<table>` doesn't...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    **[Test build #76913 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76913/testReport)** for PR 17967 at commit [`698588e`](https://github.com/apache/spark/commit/698588e15b0407e987dad77fb060f0404c8276a9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula allows to drop the same categ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    **[Test build #76878 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76878/testReport)** for PR 17967 at commit [`a1be94c`](https://github.com/apache/spark/commit/a1be94cf92649ec553da3b47fd481f5a1ac37079).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77110/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula allows to drop the same categ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/17967

@yanboliang I understand your points. The issue is `OneHotEncoder` only supports `dropLast`.
The ideal solution to match R exactly (both the category dropped and ordering of feature columns) will be use `alphabetAsc` in StringIndexer and `dropFirst` in OneHotEncoder.

Without changing `OneHotEncoder`, the best I can do in this PR is to match only the category that is dropped in R. This will make sure the model interpretation and magnitude of coefficients are consistent with R, but the ordering among the feature columns are still different, which is a minor issue. That's also why I sorted the coefficients first in the example above to compare GLM results.

Please let me know if this is clear and your thought on `OneHotEncoder`. If adding a `dropFirst` is preferred, I can also update `OneHotEncoder`. But that may cause some disruption. Thanks.

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659] RFormula allows to drop the same category ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    **[Test build #76877 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76877/testReport)** for PR 17967 at commit [`77fe864`](https://github.com/apache/spark/commit/77fe864770420719d396715479fc1f452a80b8da).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    I tried using `<table cellspacing="4">`, but in Scaladoc, it is not correctly formatted. I tried a few other options, but it seems the html attributes are ignored in Scaladoc. 
    
    ![image](https://cloud.githubusercontent.com/assets/11082368/26425130/9aa59bcc-4088-11e7-9740-2e6190c8dee1.png)
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77130/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17967: [SPARK-14659][ML] RFormula consistent with R when...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17967#discussion_r116387686
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala ---
    @@ -37,6 +37,29 @@ import org.apache.spark.sql.types._
      */
     private[feature] trait RFormulaBase extends HasFeaturesCol with HasLabelCol {
     
    +  /**
    +   * Param for how to order labels of string column. The first label after ordering is assigned
    +   * an index of 0.
    +   * Options are:
    +   *   - 'frequencyDesc': descending order by label frequency (most frequent label assigned 0)
    +   *   - 'frequencyAsc': ascending order by label frequency (least frequent label assigned 0)
    +   *   - 'alphabetDesc': descending alphabetical order
    +   *   - 'alphabetAsc': ascending alphabetical order
    +   * Default is 'frequencyDesc'.
    +   *
    +   * @group param
    +   */
    +  @Since("2.3.0")
    +  final val stringOrderType: Param[String] = new Param(this, "stringOrderType",
    +    "how to order labels of string column. " +
    --- End diff --
    
    should it be capitalize `"How`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659] RFormula allows to drop the same category ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76877/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17967: [SPARK-14659][ML] RFormula consistent with R when...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17967#discussion_r117602143
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala ---
    @@ -38,29 +38,35 @@ import org.apache.spark.sql.types._
     private[feature] trait RFormulaBase extends HasFeaturesCol with HasLabelCol {
     
       /**
    -   * Param for how to order labels of string column. The first label after ordering is assigned
    -   * an index of 0.
    -   * Options are:
    -   *   - 'frequencyDesc': descending order by label frequency (most frequent label assigned 0)
    -   *   - 'frequencyAsc': ascending order by label frequency (least frequent label assigned 0)
    -   *   - 'alphabetDesc': descending alphabetical order
    -   *   - 'alphabetAsc': ascending alphabetical order
    -   * Default is 'frequencyDesc'.
    -   * When the ordering is set to 'alphabetDesc', `RFormula` drops the same category as R
    -   * when encoding strings.
    +   * Param for how to order categories of a FEATURE string column used by `StringIndexer`.
    +   * The last category after ordering is dropped when encoding strings.
    +   * The options are explained using an example string: 'b', 'a', 'b', 'a', 'c', 'b'
    +   * |
    +   * | Option | Category mapped to 0 by StringIndexer |  Category dropped by RFormula
    --- End diff --
    
    Hm, up to my knowledge, it looks not. I guess @actuaryzhang meant to just write these out as they are? Let me double check by myself ....
    
    Scaladoc
    
    <img width="952" alt="2017-05-20 1 40 35" src="https://cloud.githubusercontent.com/assets/6477701/26273032/0d97fd84-3d62-11e7-8f18-1c89f539b1ae.png">
    
    
    Javadoc
    
    <img width="1086" alt="2017-05-20 1 40 53" src="https://cloud.githubusercontent.com/assets/6477701/26273031/0bac57cc-3d62-11e7-8875-6f897b093633.png">
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17967: [SPARK-14659][ML] RFormula consistent with R when...

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17967#discussion_r117892629
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala ---
    @@ -37,6 +37,42 @@ import org.apache.spark.sql.types._
      */
     private[feature] trait RFormulaBase extends HasFeaturesCol with HasLabelCol {
     
    +  /**
    +   * Param for how to order categories of a string FEATURE column used by `StringIndexer`.
    +   * The last category after ordering is dropped when encoding strings.
    +   * Supported options: 'frequencyDesc', 'frequencyAsc', 'alphabetDesc', 'alphabetAsc'.
    +   * The default value is 'frequencyDesc'. When the ordering is set to 'alphabetDesc', `RFormula`
    +   * drops the same category as R when encoding strings.
    +   *
    +   * The options are explained using an example `'b', 'a', 'b', 'a', 'c', 'b'`:
    +   * {{{
    +   * +-----------------+---------------------------------------+----------------------------------+
    --- End diff --
    
    @HyukjinKwon Thanks for the clarification. I don't think `list` paints a clear picture here. Would rather keep the table structure. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    Merged into master, thanks for all.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    @HyukjinKwon @felixcheung I confirm it works for Javadoc. 
    ![image](https://cloud.githubusercontent.com/assets/11082368/26277962/21dbe70e-3d46-11e7-978f-e422b9122e87.png)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    @felixcheung Is the html tag `<table>` supported? Tried this but failed to compile... 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17967: [SPARK-14659][ML] RFormula consistent with R when...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17967#discussion_r117753728
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala ---
    @@ -37,6 +37,42 @@ import org.apache.spark.sql.types._
      */
     private[feature] trait RFormulaBase extends HasFeaturesCol with HasLabelCol {
     
    +  /**
    +   * Param for how to order categories of a string FEATURE column used by `StringIndexer`.
    +   * The last category after ordering is dropped when encoding strings.
    +   * Supported options: 'frequencyDesc', 'frequencyAsc', 'alphabetDesc', 'alphabetAsc'.
    +   * The default value is 'frequencyDesc'. When the ordering is set to 'alphabetDesc', `RFormula`
    +   * drops the same category as R when encoding strings.
    +   *
    +   * The options are explained using an example `'b', 'a', 'b', 'a', 'c', 'b'`:
    +   * {{{
    +   * +-----------------+---------------------------------------+----------------------------------+
    +   * |      Option     | Category mapped to 0 by StringIndexer |  Category dropped by RFormula    |
    +   * +-----------------+---------------------------------------+----------------------------------+
    +   * | 'frequencyDesc' | most frequent category ('b')          | least frequent category ('c')    |
    +   * | 'frequencyAsc'  | least frequent category ('c')         | most frequent category ('b')     |
    +   * | 'alphabetDesc'  | first alphabetical category ('a')     | last alphabetical category ('c') |
    +   * | 'alphabetAsc'   | last alphabetical category ('c')      | first alphabetical category ('a')|
    --- End diff --
    
    I think ```ascending``` is going up i.e. A-Z, ```descending``` is going down i.e. Z-A. So in this example, first alphabetical category is "c" if order by ```alphabetDesc```, first alphabetical category is "a" if order by ```alphabetAsc```. Please correct me if I have misunderstand. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    @yanboliang @MLnick @HyukjinKwon @jkbradley @sethah 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17967: [SPARK-14659][ML] RFormula consistent with R when...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17967#discussion_r117878731
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala ---
    @@ -37,6 +37,42 @@ import org.apache.spark.sql.types._
      */
     private[feature] trait RFormulaBase extends HasFeaturesCol with HasLabelCol {
     
    +  /**
    +   * Param for how to order categories of a string FEATURE column used by `StringIndexer`.
    +   * The last category after ordering is dropped when encoding strings.
    +   * Supported options: 'frequencyDesc', 'frequencyAsc', 'alphabetDesc', 'alphabetAsc'.
    +   * The default value is 'frequencyDesc'. When the ordering is set to 'alphabetDesc', `RFormula`
    +   * drops the same category as R when encoding strings.
    +   *
    +   * The options are explained using an example `'b', 'a', 'b', 'a', 'c', 'b'`:
    +   * {{{
    +   * +-----------------+---------------------------------------+----------------------------------+
    --- End diff --
    
    I guess I am not supposed to make a decision call though. Please let me know @felixcheung and @yanboliang if you have any preference.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77085/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    **[Test build #77130 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77130/testReport)** for PR 17967 at commit [`24818a7`](https://github.com/apache/spark/commit/24818a7b77676665f9e58a88f8cc59073e368062).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77209/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17967: [SPARK-14659][ML] RFormula consistent with R when...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17967#discussion_r117524713
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala ---
    @@ -37,6 +37,31 @@ import org.apache.spark.sql.types._
      */
     private[feature] trait RFormulaBase extends HasFeaturesCol with HasLabelCol {
     
    +  /**
    +   * Param for how to order labels of string column. The first label after ordering is assigned
    +   * an index of 0.
    +   * Options are:
    +   *   - 'frequencyDesc': descending order by label frequency (most frequent label assigned 0)
    +   *   - 'frequencyAsc': ascending order by label frequency (least frequent label assigned 0)
    +   *   - 'alphabetDesc': descending alphabetical order
    +   *   - 'alphabetAsc': ascending alphabetical order
    +   * Default is 'frequencyDesc'.
    +   * When the ordering is set to 'alphabetDesc', `RFormula` drops the same category as R
    +   * when encoding strings.
    +   *
    +   * @group param
    +   */
    +  @Since("2.3.0")
    +  final val stringOrderType: Param[String] = new Param(this, "stringOrderType",
    --- End diff --
    
    What about ```orderTypeOfStringIndexer``` or any other suggestion? I think the param name ```stringOrderType``` in ```StringIndexer```   is clear enough, but ```RFormula``` involves lots of feature transformers across many steps, we should make users understand this param only takes effect on ```StringIndexer``` stage. And it's better to document more clear rather than copying corresponding section from ```StringIndexer```. cc @felixcheung 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17967: [SPARK-14659][ML] RFormula consistent with R when...

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17967#discussion_r117925930
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala ---
    @@ -37,6 +37,42 @@ import org.apache.spark.sql.types._
      */
     private[feature] trait RFormulaBase extends HasFeaturesCol with HasLabelCol {
     
    +  /**
    +   * Param for how to order categories of a string FEATURE column used by `StringIndexer`.
    +   * The last category after ordering is dropped when encoding strings.
    +   * Supported options: 'frequencyDesc', 'frequencyAsc', 'alphabetDesc', 'alphabetAsc'.
    +   * The default value is 'frequencyDesc'. When the ordering is set to 'alphabetDesc', `RFormula`
    +   * drops the same category as R when encoding strings.
    +   *
    +   * The options are explained using an example `'b', 'a', 'b', 'a', 'c', 'b'`:
    +   * {{{
    +   * +-----------------+---------------------------------------+----------------------------------+
    --- End diff --
    
    @HyukjinKwon Nice. Thanks much. One issue is in the scaladoc, the columns are very close to each other. How to add spacing between columns in the scaladoc? 
    I tried `<table cellspacing="4">` but it does not seem to take any effect. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    **[Test build #77209 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77209/testReport)** for PR 17967 at commit [`1a1e06c`](https://github.com/apache/spark/commit/1a1e06c9f1690e0654f78313f674c07da2b6b6f2).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    **[Test build #77116 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77116/testReport)** for PR 17967 at commit [`341949c`](https://github.com/apache/spark/commit/341949c4c1e09baa9478e54e06aa1133b3c6fc86).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17967: [SPARK-14659][ML] RFormula consistent with R when...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/17967


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    I think a html table is better? https://github.com/apache/spark/pull/17967#discussion_r117917444
    + @srowen for your opinion- to be honest I don't think I've actually seen a table in Spark scaladoc/javadoc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    thanks for the example, I think that's very concrete that this change would be very useful


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    **[Test build #77085 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77085/testReport)** for PR 17967 at commit [`147311b`](https://github.com/apache/spark/commit/147311ba34db55f6aa6ffc3cf75f0c80c8c29cbf).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    @yanboliang I updated the example in the param doc. I hope it is clear now that it is `alphabetDesc` that drops the same category as R. That is, RFormula with `alphabetDesc` drops the first alphabetic category in string encoding. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    @yanboliang Thanks for the review and suggestion. Makes lots of sense. I made a new commit to address these. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17967: [SPARK-14659][ML] RFormula consistent with R when...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17967#discussion_r117409633
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala ---
    @@ -37,6 +37,29 @@ import org.apache.spark.sql.types._
      */
     private[feature] trait RFormulaBase extends HasFeaturesCol with HasLabelCol {
     
    +  /**
    +   * Param for how to order labels of string column. The first label after ordering is assigned
    --- End diff --
    
    Should we add a comment explaining which option is consistent with R?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17967: [SPARK-14659][ML] RFormula consistent with R when...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17967#discussion_r117872391
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala ---
    @@ -37,6 +37,42 @@ import org.apache.spark.sql.types._
      */
     private[feature] trait RFormulaBase extends HasFeaturesCol with HasLabelCol {
     
    +  /**
    +   * Param for how to order categories of a string FEATURE column used by `StringIndexer`.
    +   * The last category after ordering is dropped when encoding strings.
    +   * Supported options: 'frequencyDesc', 'frequencyAsc', 'alphabetDesc', 'alphabetAsc'.
    +   * The default value is 'frequencyDesc'. When the ordering is set to 'alphabetDesc', `RFormula`
    +   * drops the same category as R when encoding strings.
    +   *
    +   * The options are explained using an example `'b', 'a', 'b', 'a', 'c', 'b'`:
    +   * {{{
    +   * +-----------------+---------------------------------------+----------------------------------+
    --- End diff --
    
    I would like to suggest just to write out as prose with a simple list if we are all fine for now, which I guess we would generally agree with.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659] RFormula allows to drop the same category ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    **[Test build #76877 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76877/testReport)** for PR 17967 at commit [`77fe864`](https://github.com/apache/spark/commit/77fe864770420719d396715479fc1f452a80b8da).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    @felixcheung @HyukjinKwon Thanks much for pointing out the documentation issues. 
    I still prefer to have a table to clearly illustrate what each option is doing. 
    Made a new commit to make this work. Now the doc looks like: 
    
    ![image](https://cloud.githubusercontent.com/assets/11082368/26273942/4fc993be-3cf1-11e7-9c01-709b28f6833a.png)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17967: [SPARK-14659][ML] RFormula consistent with R when...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17967#discussion_r117602233

--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala ---
@@ -38,29 +38,35 @@ import org.apache.spark.sql.types._
private[feature] trait RFormulaBase extends HasFeaturesCol with HasLabelCol {

/**
- * Param for how to order labels of string column. The first label after ordering is assigned
- * an index of 0.
- * Options are:
- * - 'frequencyDesc': descending order by label frequency (most frequent label assigned 0)
- * - 'frequencyAsc': ascending order by label frequency (least frequent label assigned 0)
- * - 'alphabetDesc': descending alphabetical order
- * - 'alphabetAsc': ascending alphabetical order
- * Default is 'frequencyDesc'.
- * When the ordering is set to 'alphabetDesc', `RFormula` drops the same category as R
- * when encoding strings.
+ * Param for how to order categories of a FEATURE string column used by `StringIndexer`.
+ * The last category after ordering is dropped when encoding strings.
+ * The options are explained using an example string: 'b', 'a', 'b', 'a', 'c', 'b'
+ * |
+ * | Option | Category mapped to 0 by StringIndexer | Category dropped by RFormula
--- End diff --

BTW, it looks hidden in the API documentation. My suggestion is prose with simple `-` or resembling any other formats in this package if there are similar instances.

Probably, it would be fine if the current format as is looks okay to you @felixcheung.

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17967: [SPARK-14659][ML] RFormula consistent with R when...

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17967#discussion_r117775336
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala ---
    @@ -37,6 +37,42 @@ import org.apache.spark.sql.types._
      */
     private[feature] trait RFormulaBase extends HasFeaturesCol with HasLabelCol {
     
    +  /**
    +   * Param for how to order categories of a string FEATURE column used by `StringIndexer`.
    +   * The last category after ordering is dropped when encoding strings.
    +   * Supported options: 'frequencyDesc', 'frequencyAsc', 'alphabetDesc', 'alphabetAsc'.
    +   * The default value is 'frequencyDesc'. When the ordering is set to 'alphabetDesc', `RFormula`
    +   * drops the same category as R when encoding strings.
    +   *
    +   * The options are explained using an example `'b', 'a', 'b', 'a', 'c', 'b'`:
    +   * {{{
    +   * +-----------------+---------------------------------------+----------------------------------+
    +   * |      Option     | Category mapped to 0 by StringIndexer |  Category dropped by RFormula    |
    +   * +-----------------+---------------------------------------+----------------------------------+
    +   * | 'frequencyDesc' | most frequent category ('b')          | least frequent category ('c')    |
    +   * | 'frequencyAsc'  | least frequent category ('c')         | most frequent category ('b')     |
    +   * | 'alphabetDesc'  | first alphabetical category ('a')     | last alphabetical category ('c') |
    +   * | 'alphabetAsc'   | last alphabetical category ('c')      | first alphabetical category ('a')|
    --- End diff --
    
    Yes, you are right. I'll fix this example. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    yes I'd hold this for a day.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    **[Test build #77085 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77085/testReport)** for PR 17967 at commit [`147311b`](https://github.com/apache/spark/commit/147311ba34db55f6aa6ffc3cf75f0c80c8c29cbf).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    **[Test build #77209 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77209/testReport)** for PR 17967 at commit [`1a1e06c`](https://github.com/apache/spark/commit/1a1e06c9f1690e0654f78313f674c07da2b6b6f2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17967: [SPARK-14659][ML] RFormula consistent with R when...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17967#discussion_r117752818
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala ---
    @@ -37,6 +37,42 @@ import org.apache.spark.sql.types._
      */
     private[feature] trait RFormulaBase extends HasFeaturesCol with HasLabelCol {
     
    +  /**
    +   * Param for how to order categories of a string FEATURE column used by `StringIndexer`.
    +   * The last category after ordering is dropped when encoding strings.
    --- End diff --
    
    Is this correct? Do you have some references? AFAIK, R formula drop the first category alphabetically ascending order when encode string/category feature (which is consistent with your PR description). I think ```test("StringIndexer order types")``` in #17879 is correct. Could you double check this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    given that I think I'm ok with an ascii table as a one time thing.
    thoughts?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659] RFormula allows to drop the same category ...

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    @felixcheung @viirya @ericl @yinxusen
    ```
    val original = Seq((1, "foo", 4), (2, "bar", 4), (3, "bar", 5), (4, "baz", 5)).toDF("id", "a", "b")
    original.show
    +---+---+---+
    | id|  a|  b|
    +---+---+---+
    |  1|foo|  4|
    |  2|bar|  4|
    |  3|bar|  5|
    |  4|aaz|  5|
    +---+---+---+
    
    // Try different string order types
    val formula = new RFormula().setFormula("id ~ a + b")
    formula.setStringOrderType("frequencyDsc").fit(original).transform(original)
    +---+---+---+-------------+-----+
    | id|  a|  b|     features|label|
    +---+---+---+-------------+-----+
    |  1|foo|  4|[0.0,0.0,4.0]|  1.0|
    |  2|bar|  4|[1.0,0.0,4.0]|  2.0|
    |  3|bar|  5|[1.0,0.0,5.0]|  3.0|
    |  4|aaz|  5|[0.0,1.0,5.0]|  4.0|
    +---+---+---+-------------+-----+
    
    formula.setStringOrderType("frequencyAsc").fit(original).transform(original)
    +---+---+---+-------------+-----+
    | id|  a|  b|     features|label|
    +---+---+---+-------------+-----+
    |  1|foo|  4|[0.0,1.0,4.0]|  1.0|
    |  2|bar|  4|[0.0,0.0,4.0]|  2.0|
    |  3|bar|  5|[0.0,0.0,5.0]|  3.0|
    |  4|aaz|  5|[1.0,0.0,5.0]|  4.0|
    +---+---+---+-------------+-----+
    
    formula.setStringOrderType("alphabetDsc").fit(original).transform(original)
    +---+---+---+-------------+-----+
    | id|  a|  b|     features|label|
    +---+---+---+-------------+-----+
    |  1|foo|  4|[1.0,0.0,4.0]|  1.0|
    |  2|bar|  4|[0.0,1.0,4.0]|  2.0|
    |  3|bar|  5|[0.0,1.0,5.0]|  3.0|
    |  4|aaz|  5|[0.0,0.0,5.0]|  4.0|
    +---+---+---+-------------+-----+
    
    formula.setStringOrderType("alphabetAsc").fit(original).transform(original)
    +---+---+---+-------------+-----+
    | id|  a|  b|     features|label|
    +---+---+---+-------------+-----+
    |  1|foo|  4|[0.0,0.0,4.0]|  1.0|
    |  2|bar|  4|[0.0,1.0,4.0]|  2.0|
    |  3|bar|  5|[0.0,1.0,5.0]|  3.0|
    |  4|aaz|  5|[1.0,0.0,5.0]|  4.0|
    +---+---+---+-------------+-----+
    ```
    Note the last one with `stringOrderType = "alphabetAsc")` drops the same category as R  (i.e., "foo" is dropped and treated as the reference level). The following is from R
    ```
      df <- list(list(1, "foo", 4), list(2, "bar", 4), list(3, "bar", 5), list(4, "aaz", 5))
      df <- do.call(rbind, lapply(df, as.data.frame, col.names = c("id", "a", "b")))
      model.matrix(id ~ a + b, df)[, -1]
         abar aaaz b
          0    0   4
          1    0   4
          1    0   5
          0    1   5
    ``
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    @actuaryzhang Thanks for your clarification, it makes sense. This looks good to me. @HyukjinKwon @felixcheung What do you think of the documentation issue?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76913/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula allows to drop the same categ...

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    Questions I have:
    1. StringIndexer and RFormula use repetitive code for defining `stringOrderType`. Should this be moved to a trait? 
    2. When setting `stringOrderType = "alphabetAsc"`, RFormula will drop the same category as R. This makes sure Spark will produce the same model as base R. However, the ordering of the columns are still different. In the above example, R columns correspond to `("bar", "aaz")`, while RFormula columns correspond to `("aaz", "bar")`. Are we OK with this? It seems that to make 100% alignment with base R, we need `stringOrderType = "alphabetDsc"` in StringIndexer, and an option for `dropFirst` in OneHotEncoder. IMO, as long as RFormula and R drop the same category, we are fine. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    **[Test build #77130 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77130/testReport)** for PR 17967 at commit [`24818a7`](https://github.com/apache/spark/commit/24818a7b77676665f9e58a88f8cc59073e368062).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17967: [SPARK-14659][ML] RFormula consistent with R when...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17967#discussion_r117877363
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala ---
    @@ -37,6 +37,42 @@ import org.apache.spark.sql.types._
      */
     private[feature] trait RFormulaBase extends HasFeaturesCol with HasLabelCol {
     
    +  /**
    +   * Param for how to order categories of a string FEATURE column used by `StringIndexer`.
    +   * The last category after ordering is dropped when encoding strings.
    +   * Supported options: 'frequencyDesc', 'frequencyAsc', 'alphabetDesc', 'alphabetAsc'.
    +   * The default value is 'frequencyDesc'. When the ordering is set to 'alphabetDesc', `RFormula`
    +   * drops the same category as R when encoding strings.
    +   *
    +   * The options are explained using an example `'b', 'a', 'b', 'a', 'c', 'b'`:
    +   * {{{
    +   * +-----------------+---------------------------------------+----------------------------------+
    --- End diff --
    
    Ah, sure, I initially meant a HTML list that we are already using - https://github.com/apache/spark/blob/04901dd03a3f8062fd39ea38d585935ff71a9248/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L304-L340 ... 
    
    ```html
    <ul>
      <li> abc </li>
      <li> abc </li>
    </ul>
    ```
    
    I just tested it to double-check a wiki-style list ( `-` ) - http://subnormalnumbers.blogspot.kr/2011/08/scaladoc-wiki-syntax.html. This does not work correctly as below (but please go ahead if you know any compatible way for both Scaladoc and Javadoc):
    
    ```
       *  1. item one
       *
       *  1. item two
       *    - sublist
       *    - next item
       *
       *  1. now for broken sub-numbered list, the leading item must be one of
       *     `-`, `1.`, `I.`, `i.`, `A.`, or `a.`. And it must be followed by a space.
       *    1. one
       *    2. two
       *    3. three
       *
       *  1. list types
       *    I. one
       *      i. one
       *      i. two
       *    I. two
       *      A. one
       *      A. two
       *    I. three
       *      a. one
       *      a. two
    ```
    
    Scaladoc
    
    ![2017-05-23 9 52 51](https://cloud.githubusercontent.com/assets/6477701/26334123/b153378a-3f9d-11e7-8852-31b519ec9f21.png)
    
    Javadoc
    
    ![2017-05-23 9 53 07](https://cloud.githubusercontent.com/assets/6477701/26334121/af2696d2-3f9d-11e7-9924-f0e4373b2cce.png)
    
    
    
    My worry is, it draws attention with a different format. I believe we have similar instances but wonder if it is worth changing only this one. I would not strongly against but `{{{ ... }}}` basically means codes. If we can't find a better way to render this, I would leave this out as prose with a list.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77116/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17967: [SPARK-14659][ML] RFormula consistent with R when...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17967#discussion_r117906884
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala ---
    @@ -37,6 +37,42 @@ import org.apache.spark.sql.types._
      */
     private[feature] trait RFormulaBase extends HasFeaturesCol with HasLabelCol {
     
    +  /**
    +   * Param for how to order categories of a string FEATURE column used by `StringIndexer`.
    +   * The last category after ordering is dropped when encoding strings.
    +   * Supported options: 'frequencyDesc', 'frequencyAsc', 'alphabetDesc', 'alphabetAsc'.
    +   * The default value is 'frequencyDesc'. When the ordering is set to 'alphabetDesc', `RFormula`
    +   * drops the same category as R when encoding strings.
    +   *
    +   * The options are explained using an example `'b', 'a', 'b', 'a', 'c', 'b'`:
    +   * {{{
    +   * +-----------------+---------------------------------------+----------------------------------+
    --- End diff --
    
    according to this, table is https://wiki.scala-lang.org/display/SW/Syntax



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    @felixcheung Once this PR gets in, I'll update the SparkR side and include some test. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    **[Test build #76913 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76913/testReport)** for PR 17967 at commit [`698588e`](https://github.com/apache/spark/commit/698588e15b0407e987dad77fb060f0404c8276a9).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659] RFormula allows to drop the same category ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    Personally, I would prefer a HTML list or table one. But I am fine with the current status if this is okay to all of you here (as I guess none of them is particularly better given all the comments above).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    hmm, should we just use html `<table>`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17967: [SPARK-14659][ML] RFormula consistent with R when...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17967#discussion_r117755819
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala ---
    @@ -129,6 +129,90 @@ class RFormulaSuite extends SparkFunSuite with MLlibTestSparkContext with Defaul
         assert(result.collect() === expected.collect())
       }
     
    +  test("encodes string terms with string indexer order type") {
    +    val formula = new RFormula().setFormula("id ~ a + b")
    +    val original = Seq((1, "foo", 4), (2, "bar", 4), (3, "bar", 5), (4, "aaz", 5))
    +      .toDF("id", "a", "b")
    +
    +    val expected = Seq(
    +      Seq(
    +        (1, "foo", 4, Vectors.dense(0.0, 0.0, 4.0), 1.0),
    +        (2, "bar", 4, Vectors.dense(1.0, 0.0, 4.0), 2.0),
    +        (3, "bar", 5, Vectors.dense(1.0, 0.0, 5.0), 3.0),
    +        (4, "aaz", 5, Vectors.dense(0.0, 1.0, 5.0), 4.0)
    +      ).toDF("id", "a", "b", "features", "label"),
    +      Seq(
    +        (1, "foo", 4, Vectors.dense(0.0, 1.0, 4.0), 1.0),
    +        (2, "bar", 4, Vectors.dense(0.0, 0.0, 4.0), 2.0),
    +        (3, "bar", 5, Vectors.dense(0.0, 0.0, 5.0), 3.0),
    +        (4, "aaz", 5, Vectors.dense(1.0, 0.0, 5.0), 4.0)
    +      ).toDF("id", "a", "b", "features", "label"),
    +      Seq(
    +        (1, "foo", 4, Vectors.dense(1.0, 0.0, 4.0), 1.0),
    +        (2, "bar", 4, Vectors.dense(0.0, 1.0, 4.0), 2.0),
    +        (3, "bar", 5, Vectors.dense(0.0, 1.0, 5.0), 3.0),
    +        (4, "aaz", 5, Vectors.dense(0.0, 0.0, 5.0), 4.0)
    +      ).toDF("id", "a", "b", "features", "label"),
    +      Seq(
    +        (1, "foo", 4, Vectors.dense(0.0, 0.0, 4.0), 1.0),
    +        (2, "bar", 4, Vectors.dense(0.0, 1.0, 4.0), 2.0),
    +        (3, "bar", 5, Vectors.dense(0.0, 1.0, 5.0), 3.0),
    +        (4, "aaz", 5, Vectors.dense(1.0, 0.0, 5.0), 4.0)
    +      ).toDF("id", "a", "b", "features", "label")
    +    )
    +
    +    var idx = 0
    +    for (orderType <- StringIndexer.supportedStringOrderType) {
    +      val model = formula.setStringIndexerOrderType(orderType).fit(original)
    +      val result = model.transform(original)
    +      val resultSchema = model.transformSchema(original.schema)
    +      assert(result.schema.toString == resultSchema.toString)
    +      assert(result.collect() === expected(idx).collect())
    +      idx += 1
    +    }
    +  }
    +
    +  test("test consistency with R when encoding string terms") {
    +    /*
    +     R code:
    +
    +     df <- data.frame(id = c(1, 2, 3, 4),
    +                  a = c("foo", "bar", "bar", "aaz"),
    +                  b = c(4, 4, 5, 5))
    +     model.matrix(id ~ a + b, df)[, -1]
    +
    +     abar afoo b
    +      0    1   4
    +      1    0   4
    +      1    0   5
    +      0    0   5
    +    */
    +    val original = Seq((1, "foo", 4), (2, "bar", 4), (3, "bar", 5), (4, "aaz", 5))
    +      .toDF("id", "a", "b")
    +    val formula = new RFormula().setFormula("id ~ a + b")
    +      .setStringIndexerOrderType(StringIndexer.alphabetDesc)
    +
    +    /*
    +     Note that the category dropped after encoding is the same between R and Spark
    +     (i.e., "aaz" is treated as the reference level).
    +     However, the column order is still different:
    +     R renders the columns in ascending alphabetical order ("bar", "foo"), while
    +     RFormula renders the columns in descending alphabetical order ("foo", "bar").
    --- End diff --
    
    R and RFormula should behavior consistent if you fix the issue I mentioned above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    @felixcheung  Thanks for the review. I fixed some typo. 
    Below is an example to show the difference in model estimates due to different string ordering between R and RFormula.  
    
    ```
    val df = Seq((1.0, "foo", "a"), (2.0, "bar", "b"), (3.0, "bar", "b"), (4.0, "aaz", "b"),
        (4.2, "aaz", "b"), (1.6, "bar", "a")).toDF("id", "a", "b")
    val formula = new RFormula().setFormula("id ~ a + b")
    for (orderType <- Seq("frequencyDesc", "alphabetDesc")) {
     val df2 = formula.setStringOrderType(orderType).fit(df).transform(df)
     val model = new GeneralizedLinearRegression().fit(df2)
     val estimate = (model.coefficients.toArray :+ model.intercept)
     println(orderType + ": " + estimate.sortWith(_ < _).mkString(","))
    }
    frequencyDesc: 0.5999999999999952,0.8999999999999999,1.0000000000000042,2.1999999999999957
    alphabetDesc: -2.200000000000006,-1.6000000000000025,0.899999999999996,3.200000000000005
    ```
    
    The following is the estimate from R, which is the same as `stringOrderType = "alphabetDesc"`.
    ```
    > df <- data.frame(id = c(1, 2, 3, 4, 4.2, 1.6),
    +                   a = c("foo", "bar", "bar", "aaz", "aaz", "bar"),
    +                   b = c("a", "b", "b", "b", "b", "a"))
    > sort(coef(lm(id ~ a + b, data = df)))
           afoo        abar          bb (Intercept) 
           -2.2        -1.6         0.9         3.2 
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    (FWIW, `{{{ ... }}}` should work for Javadoc too given my past try - https://github.com/apache/spark/pull/15999#discussion_r89580586)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

Posted by actuaryzhang <gi...@git.apache.org>.

Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/17967

@yanboliang Thanks for the question.

The alphabetically ascending order in R is very convenient for display purpose. For example, when you do a summary of model results, the results will be easier to understand if it is in alphabetically ascending order.

That's the default, but oftentimes users will reset the reference level to make the most frequent level as the base (the one dropped in one-hot encoding). This also facilitates interpretation, because the most frequent level can be roughly regarded as the population average (in very unbalanced data). Otherwise, especially in unbalanced data, the contrast between categories with few data is most times insignificant. Of course, this does not change the model, but it is very important for interpretation.

I understand that ordering string levels by descending frequency is helpful for other applications like tree based split decisions. But it will make the ML library much better if we can support these other options that are often used in day-to-day work. This will broaden the use case of Spark ML.

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula allows to drop the same categ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    **[Test build #76878 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76878/testReport)** for PR 17967 at commit [`a1be94c`](https://github.com/apache/spark/commit/a1be94cf92649ec553da3b47fd481f5a1ac37079).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    **[Test build #77110 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77110/testReport)** for PR 17967 at commit [`5f31d31`](https://github.com/apache/spark/commit/5f31d311c0c39da1968686dd4147376b3888cee3).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    cool - I think this is important to have. do you have a higher level example of the old/new model output as affected by the string ordering?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17967: [SPARK-14659][ML] RFormula consistent with R when...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17967#discussion_r117601484
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala ---
    @@ -38,29 +38,35 @@ import org.apache.spark.sql.types._
     private[feature] trait RFormulaBase extends HasFeaturesCol with HasLabelCol {
     
       /**
    -   * Param for how to order labels of string column. The first label after ordering is assigned
    -   * an index of 0.
    -   * Options are:
    -   *   - 'frequencyDesc': descending order by label frequency (most frequent label assigned 0)
    -   *   - 'frequencyAsc': ascending order by label frequency (least frequent label assigned 0)
    -   *   - 'alphabetDesc': descending alphabetical order
    -   *   - 'alphabetAsc': ascending alphabetical order
    -   * Default is 'frequencyDesc'.
    -   * When the ordering is set to 'alphabetDesc', `RFormula` drops the same category as R
    -   * when encoding strings.
    +   * Param for how to order categories of a FEATURE string column used by `StringIndexer`.
    +   * The last category after ordering is dropped when encoding strings.
    +   * The options are explained using an example string: 'b', 'a', 'b', 'a', 'c', 'b'
    +   * |
    +   * | Option | Category mapped to 0 by StringIndexer |  Category dropped by RFormula
    --- End diff --
    
    is this going to generate the right format, @HyukjinKwon do you know?
    I understand not all markdown style is supported


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org