You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by actuaryzhang <gi...@git.apache.org> on 2017/05/14 21:02:52 UTC

[GitHub] spark pull request #17978: [SPARK-20736] PySpark StringIndexer supports Stri...

GitHub user actuaryzhang opened a pull request:

    https://github.com/apache/spark/pull/17978

    [SPARK-20736] PySpark StringIndexer supports StringOrderType

    ## What changes were proposed in this pull request?
    PySpark StringIndexer supports StringOrderType added in #17879. 


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/actuaryzhang/spark PythonStringIndexer

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17978.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17978
    
----
commit ddf34a549ad76eda4627d19b190ba70daa232bc1
Author: Wayne Zhang <ac...@uber.com>
Date:   2017-05-14T00:41:47Z

    Python API to StringOrderType in StringIndexer

commit c1966bba863e7c7d2ea7333f377a18f232860587
Author: Wayne Zhang <ac...@uber.com>
Date:   2017-05-14T07:19:44Z

    fix typo

commit e5c8dcfcdcb9fcb9586339c3efebe85670126fb6
Author: Wayne Zhang <ac...@uber.com>
Date:   2017-05-14T20:57:44Z

    fix typo

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17978
  
    **[Test build #76967 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76967/testReport)** for PR 17978 at commit [`6acabc2`](https://github.com/apache/spark/commit/6acabc2f2d27cc25fd6cb52ff25c1ba2ce69bd23).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17978
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17978
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77132/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/17978
  
    One minor optional comment, but not a blocker so LGTM (although if you decide to update the docstring LGTM pending tests).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17978
  
    **[Test build #77132 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77132/testReport)** for PR 17978 at commit [`5bfa4dc`](https://github.com/apache/spark/commit/5bfa4dc3ba60655d9a9ce4aded935303b90d33cb).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17978
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17978
  
    **[Test build #76928 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76928/testReport)** for PR 17978 at commit [`44f0a36`](https://github.com/apache/spark/commit/44f0a362dd085022de215e9ab8d9536145f20d4d).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17978
  
    **[Test build #76917 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76917/testReport)** for PR 17978 at commit [`1f336ab`](https://github.com/apache/spark/commit/1f336ab70719f4074f4ac69cc0bb4750723b0bd5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/17978
  
    I'd hold this for another 3-4 days just in case..


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17978
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76917/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17978
  
    **[Test build #76917 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76917/testReport)** for PR 17978 at commit [`1f336ab`](https://github.com/apache/spark/commit/1f336ab70719f4074f4ac69cc0bb4750723b0bd5).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17978: [SPARK-20736][Python] PySpark StringIndexer suppo...

Posted by actuaryzhang <gi...@git.apache.org>.
Github user actuaryzhang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17978#discussion_r116411579
  
    --- Diff: python/pyspark/ml/feature.py ---
    @@ -2115,22 +2115,32 @@ class StringIndexer(JavaEstimator, HasInputCol, HasOutputCol, HasHandleInvalid,
         .. versionadded:: 1.4.0
         """
     
    +    stringOrderType = Param(Params._dummy(), "stringOrderType",
    +                            "How to order labels of string column. The first label after " +
    +                            "ordering is assigned an index of 0. Supported options: " +
    +                            "frequencyDesc, frequencyAsc, alphabetDsec, alphabetAsc.",
    +                            typeConverter=TypeConverters.toString)
    +
         @keyword_only
    -    def __init__(self, inputCol=None, outputCol=None, handleInvalid="error"):
    +    def __init__(self, inputCol=None, outputCol=None, handleInvalid="error",
    +                 stringOrderType="frequencyDesc"):
             """
    -        __init__(self, inputCol=None, outputCol=None, handleInvalid="error")
    +        __init__(self, inputCol=None, outputCol=None, handleInvalid="error", \
    --- End diff --
    
    @HyukjinKwon  Thank you. Added tests. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17978
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17978
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76928/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

Posted by actuaryzhang <gi...@git.apache.org>.
Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/17978
  
    @viirya Thanks much for your review. I corrected the typo and added some tests. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17978
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

Posted by actuaryzhang <gi...@git.apache.org>.
Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/17978
  
    @felixcheung Thanks so much for the review. I addressed most of the comments except auto generating code for defining `stringOrderType`. This parameter is not a shared trait on the Scala side, and I thought only the shared traits should be automatically generated. I have looked at the code for other ML transformers, and many of them hard coded, for example, [Imputer](https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py#L933) and [OneHotEncoder](https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py#L1520).
    
    Please let me know if I'm wrong, and a reference to example would be greatly appreciated. 
    Thanks much!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

Posted by actuaryzhang <gi...@git.apache.org>.
Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/17978
  
    @viirya @MLnick @BryanCutler @yinxusen @brkyvz @HyukjinKwon @srowen 
    Ping for reviews or comments. Thanks much.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17978
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17978
  
    LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17978: [SPARK-20736][Python] PySpark StringIndexer suppo...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17978#discussion_r116655296
  
    --- Diff: python/pyspark/ml/feature.py ---
    @@ -2111,26 +2112,45 @@ class StringIndexer(JavaEstimator, HasInputCol, HasOutputCol, HasHandleInvalid,
         >>> loadedInverter = IndexToString.load(indexToStringPath)
         >>> loadedInverter.getLabels() == inverter.getLabels()
         True
    +    >>> stringIndexer.getStringOrderType()
    +    'frequencyDesc'
    +    >>> stringIndexer = StringIndexer(inputCol="label", outputCol="indexed", handleInvalid='error',
    +    ...     stringOrderType="alphabetDesc")
    +    >>> model = stringIndexer.fit(stringIndDf)
    +    >>> td = model.transform(stringIndDf)
    +    >>> sorted(set([(i[0], i[1]) for i in td.select(td.id, td.indexed).collect()]),
    +    ...     key=lambda x: x[0])
    +    [(0, 2.0), (1, 1.0), (2, 0.0), (3, 2.0), (4, 2.0), (5, 0.0)]
     
         .. versionadded:: 1.4.0
         """
     
    +    stringOrderType = Param(Params._dummy(), "stringOrderType",
    +                            "How to order labels of string column. The first label after " +
    +                            "ordering is assigned an index of 0. Supported options: " +
    +                            "frequencyDesc, frequencyAsc, alphabetDesc, alphabetAsc.",
    --- End diff --
    
    hmm, ok, I see a few examples that they are not generated


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

Posted by actuaryzhang <gi...@git.apache.org>.
Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/17978
  
    @felixcheung Would you help merge this? Thanks. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17978
  
    Code changes looks good. But we need to add test for this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17978
  
    **[Test build #77132 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77132/testReport)** for PR 17978 at commit [`5bfa4dc`](https://github.com/apache/spark/commit/5bfa4dc3ba60655d9a9ce4aded935303b90d33cb).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17978: [SPARK-20736][Python] PySpark StringIndexer suppo...

Posted by actuaryzhang <gi...@git.apache.org>.
Github user actuaryzhang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17978#discussion_r116654554
  
    --- Diff: python/pyspark/ml/feature.py ---
    @@ -2082,8 +2082,9 @@ class StringIndexer(JavaEstimator, HasInputCol, HasOutputCol, HasHandleInvalid,
         """
         A label indexer that maps a string column of labels to an ML column of label indices.
         If the input column is numeric, we cast it to string and index the string values.
    -    The indices are in [0, numLabels), ordered by label frequencies.
    -    So the most frequent label gets index 0.
    +    The indices are in [0, numLabels). By default, this is ordered by label frequencies
    +    so the most frequent label gets index 0. The ordering behavior is controlled by
    +    setting stringOrderType.
    --- End diff --
    
    Added tag. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/17978
  
    (I am not used to ML. I just left a trivial comment for Python.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/17978
  
    LGTM, ping @holdenk @jkbradley if they are interested 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17978
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76916/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17978
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17978
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17978
  
    **[Test build #76916 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76916/testReport)** for PR 17978 at commit [`bd80b37`](https://github.com/apache/spark/commit/bd80b37d9728624c6455ceca12198ce763b32a91).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17978
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77131/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17978
  
    **[Test build #76929 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76929/testReport)** for PR 17978 at commit [`f66a445`](https://github.com/apache/spark/commit/f66a4455aba7ffc69d1b397cb828879d84bb39a6).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17978
  
    **[Test build #76967 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76967/testReport)** for PR 17978 at commit [`6acabc2`](https://github.com/apache/spark/commit/6acabc2f2d27cc25fd6cb52ff25c1ba2ce69bd23).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/17978
  
    LGTM @felixcheung 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17978: [SPARK-20736][Python] PySpark StringIndexer suppo...

Posted by actuaryzhang <gi...@git.apache.org>.
Github user actuaryzhang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17978#discussion_r116655031
  
    --- Diff: python/pyspark/ml/feature.py ---
    @@ -2111,26 +2112,45 @@ class StringIndexer(JavaEstimator, HasInputCol, HasOutputCol, HasHandleInvalid,
         >>> loadedInverter = IndexToString.load(indexToStringPath)
         >>> loadedInverter.getLabels() == inverter.getLabels()
         True
    +    >>> stringIndexer.getStringOrderType()
    +    'frequencyDesc'
    +    >>> stringIndexer = StringIndexer(inputCol="label", outputCol="indexed", handleInvalid='error',
    +    ...     stringOrderType="alphabetDesc")
    +    >>> model = stringIndexer.fit(stringIndDf)
    +    >>> td = model.transform(stringIndDf)
    +    >>> sorted(set([(i[0], i[1]) for i in td.select(td.id, td.indexed).collect()]),
    +    ...     key=lambda x: x[0])
    +    [(0, 2.0), (1, 1.0), (2, 0.0), (3, 2.0), (4, 2.0), (5, 0.0)]
     
         .. versionadded:: 1.4.0
         """
     
    +    stringOrderType = Param(Params._dummy(), "stringOrderType",
    +                            "How to order labels of string column. The first label after " +
    +                            "ordering is assigned an index of 0. Supported options: " +
    +                            "frequencyDesc, frequencyAsc, alphabetDesc, alphabetAsc.",
    --- End diff --
    
    @felixcheung `stringOrderType` is not a shared trait on the Scala side, and I thought only the shared traits should be automatically generated. 
    I have looked at the code for other ML transformers, and many of them hard coded, for example, [Imputer](https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py#L933) and [OneHotEncoder](https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py#L1520). 
    Please let me know if I'm wrong, and a reference to example would be greatly appreciated. Thanks much!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17978
  
    **[Test build #77131 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77131/testReport)** for PR 17978 at commit [`2fe9432`](https://github.com/apache/spark/commit/2fe9432945f16b77916244b0cc36ff07cdb53693).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17978: [SPARK-20736][Python] PySpark StringIndexer suppo...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17978#discussion_r116395911
  
    --- Diff: python/pyspark/ml/feature.py ---
    @@ -2115,22 +2115,32 @@ class StringIndexer(JavaEstimator, HasInputCol, HasOutputCol, HasHandleInvalid,
         .. versionadded:: 1.4.0
         """
     
    +    stringOrderType = Param(Params._dummy(), "stringOrderType",
    +                            "How to order labels of string column. The first label after " +
    +                            "ordering is assigned an index of 0. Supported options: " +
    +                            "frequencyDesc, frequencyAsc, alphabetDsec, alphabetAsc.",
    +                            typeConverter=TypeConverters.toString)
    +
         @keyword_only
    -    def __init__(self, inputCol=None, outputCol=None, handleInvalid="error"):
    +    def __init__(self, inputCol=None, outputCol=None, handleInvalid="error",
    +                 stringOrderType="frequencyDesc"):
             """
    -        __init__(self, inputCol=None, outputCol=None, handleInvalid="error")
    +        __init__(self, inputCol=None, outputCol=None, handleInvalid="error", \
    --- End diff --
    
    (Probably, the leading `\` could be removed.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17978
  
    **[Test build #76929 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76929/testReport)** for PR 17978 at commit [`f66a445`](https://github.com/apache/spark/commit/f66a4455aba7ffc69d1b397cb828879d84bb39a6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17978
  
    **[Test build #76914 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76914/testReport)** for PR 17978 at commit [`e5c8dcf`](https://github.com/apache/spark/commit/e5c8dcfcdcb9fcb9586339c3efebe85670126fb6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

Posted by actuaryzhang <gi...@git.apache.org>.
Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/17978
  
    @felixcheung Could you take a look? Thanks.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17978
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76929/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/17978
  
    merged to master. thanks everyone!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17978: [SPARK-20736][Python] PySpark StringIndexer suppo...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/17978


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17978
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76914/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17978
  
    **[Test build #76916 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76916/testReport)** for PR 17978 at commit [`bd80b37`](https://github.com/apache/spark/commit/bd80b37d9728624c6455ceca12198ce763b32a91).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17978
  
    **[Test build #76928 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76928/testReport)** for PR 17978 at commit [`44f0a36`](https://github.com/apache/spark/commit/44f0a362dd085022de215e9ab8d9536145f20d4d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

Posted by actuaryzhang <gi...@git.apache.org>.
Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/17978
  
    @viirya @felixcheung Any additional changes needed for this one? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17978: [SPARK-20736][Python] PySpark StringIndexer suppo...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17978#discussion_r116652194
  
    --- Diff: python/pyspark/ml/feature.py ---
    @@ -2139,6 +2159,20 @@ def setParams(self, inputCol=None, outputCol=None, handleInvalid="error"):
         def _create_model(self, java_model):
             return StringIndexerModel(java_model)
     
    +    @since("2.3.0")
    +    def setStringOrderType(self, value):
    +        """
    +        Sets the value of :py:attr:`stringOrderType`.
    +        """
    +        return self._set(stringOrderType=value)
    +
    +    @since("2.3.0")
    +    def getStringOrderType(self):
    +        """
    +        Gets the value of :py:attr:`stringOrderType` or its default value.
    --- End diff --
    
    should it say what the default value is?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17978: [SPARK-20736][Python] PySpark StringIndexer suppo...

Posted by actuaryzhang <gi...@git.apache.org>.
Github user actuaryzhang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17978#discussion_r116654542
  
    --- Diff: python/pyspark/ml/feature.py ---
    @@ -2139,6 +2159,20 @@ def setParams(self, inputCol=None, outputCol=None, handleInvalid="error"):
         def _create_model(self, java_model):
             return StringIndexerModel(java_model)
     
    +    @since("2.3.0")
    +    def setStringOrderType(self, value):
    +        """
    +        Sets the value of :py:attr:`stringOrderType`.
    +        """
    +        return self._set(stringOrderType=value)
    +
    +    @since("2.3.0")
    +    def getStringOrderType(self):
    +        """
    +        Gets the value of :py:attr:`stringOrderType` or its default value.
    --- End diff --
    
    OK, added default value.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17978
  
    **[Test build #77131 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77131/testReport)** for PR 17978 at commit [`2fe9432`](https://github.com/apache/spark/commit/2fe9432945f16b77916244b0cc36ff07cdb53693).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17978: [SPARK-20736] PySpark StringIndexer supports StringOrder...

Posted by actuaryzhang <gi...@git.apache.org>.
Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/17978
  
    @felixcheung 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17978: [SPARK-20736][Python] PySpark StringIndexer suppo...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17978#discussion_r117612782
  
    --- Diff: python/pyspark/ml/feature.py ---
    @@ -2111,26 +2112,45 @@ class StringIndexer(JavaEstimator, HasInputCol, HasOutputCol, HasHandleInvalid,
         >>> loadedInverter = IndexToString.load(indexToStringPath)
         >>> loadedInverter.getLabels() == inverter.getLabels()
         True
    +    >>> stringIndexer.getStringOrderType()
    +    'frequencyDesc'
    +    >>> stringIndexer = StringIndexer(inputCol="label", outputCol="indexed", handleInvalid='error',
    +    ...     stringOrderType="alphabetDesc")
    +    >>> model = stringIndexer.fit(stringIndDf)
    +    >>> td = model.transform(stringIndDf)
    +    >>> sorted(set([(i[0], i[1]) for i in td.select(td.id, td.indexed).collect()]),
    +    ...     key=lambda x: x[0])
    +    [(0, 2.0), (1, 1.0), (2, 0.0), (3, 2.0), (4, 2.0), (5, 0.0)]
     
         .. versionadded:: 1.4.0
         """
     
    +    stringOrderType = Param(Params._dummy(), "stringOrderType",
    +                            "How to order labels of string column. The first label after " +
    +                            "ordering is assigned an index of 0. Supported options: " +
    +                            "frequencyDesc, frequencyAsc, alphabetDesc, alphabetAsc.",
    --- End diff --
    
    I know were mixed on doing this, but I like including the default value in the docstring, makes the documentation closer to the Scala doc and makes it easier to read without having to refer to the ScalaDoc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17978
  
    **[Test build #76914 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76914/testReport)** for PR 17978 at commit [`e5c8dcf`](https://github.com/apache/spark/commit/e5c8dcfcdcb9fcb9586339c3efebe85670126fb6).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17978: [SPARK-20736][Python] PySpark StringIndexer suppo...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17978#discussion_r116652156
  
    --- Diff: python/pyspark/ml/feature.py ---
    @@ -2082,8 +2082,9 @@ class StringIndexer(JavaEstimator, HasInputCol, HasOutputCol, HasHandleInvalid,
         """
         A label indexer that maps a string column of labels to an ML column of label indices.
         If the input column is numeric, we cast it to string and index the string values.
    -    The indices are in [0, numLabels), ordered by label frequencies.
    -    So the most frequent label gets index 0.
    +    The indices are in [0, numLabels). By default, this is ordered by label frequencies
    +    so the most frequent label gets index 0. The ordering behavior is controlled by
    +    setting stringOrderType.
    --- End diff --
    
    I think you need to backtick and add a tag for the attribute


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17978: [SPARK-20736][Python] PySpark StringIndexer suppo...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17978#discussion_r116652102
  
    --- Diff: python/pyspark/ml/feature.py ---
    @@ -2111,26 +2112,45 @@ class StringIndexer(JavaEstimator, HasInputCol, HasOutputCol, HasHandleInvalid,
         >>> loadedInverter = IndexToString.load(indexToStringPath)
         >>> loadedInverter.getLabels() == inverter.getLabels()
         True
    +    >>> stringIndexer.getStringOrderType()
    +    'frequencyDesc'
    +    >>> stringIndexer = StringIndexer(inputCol="label", outputCol="indexed", handleInvalid='error',
    +    ...     stringOrderType="alphabetDesc")
    +    >>> model = stringIndexer.fit(stringIndDf)
    +    >>> td = model.transform(stringIndDf)
    +    >>> sorted(set([(i[0], i[1]) for i in td.select(td.id, td.indexed).collect()]),
    +    ...     key=lambda x: x[0])
    +    [(0, 2.0), (1, 1.0), (2, 0.0), (3, 2.0), (4, 2.0), (5, 0.0)]
     
         .. versionadded:: 1.4.0
         """
     
    +    stringOrderType = Param(Params._dummy(), "stringOrderType",
    +                            "How to order labels of string column. The first label after " +
    +                            "ordering is assigned an index of 0. Supported options: " +
    +                            "frequencyDesc, frequencyAsc, alphabetDesc, alphabetAsc.",
    --- End diff --
    
    I think this should be generated instead of hardcoded - you can find a example on python..


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17978
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

Posted by actuaryzhang <gi...@git.apache.org>.
Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/17978
  
    @holdenk  Thanks for the comment. Added default value in docstring. 
    @felixcheung Please let me know if there is anything else needed for this PR. 
    Thanks everyone for the review and comments! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17978: [SPARK-20736][Python] PySpark StringIndexer suppo...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17978#discussion_r116400199
  
    --- Diff: python/pyspark/ml/feature.py ---
    @@ -2115,22 +2115,32 @@ class StringIndexer(JavaEstimator, HasInputCol, HasOutputCol, HasHandleInvalid,
         .. versionadded:: 1.4.0
         """
     
    +    stringOrderType = Param(Params._dummy(), "stringOrderType",
    +                            "How to order labels of string column. The first label after " +
    +                            "ordering is assigned an index of 0. Supported options: " +
    +                            "frequencyDesc, frequencyAsc, alphabetDsec, alphabetAsc.",
    --- End diff --
    
    alphabetDsec -> alphabetDesc


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17978: [SPARK-20736][Python] PySpark StringIndexer suppo...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17978#discussion_r116395876
  
    --- Diff: python/pyspark/ml/feature.py ---
    @@ -2115,22 +2115,32 @@ class StringIndexer(JavaEstimator, HasInputCol, HasOutputCol, HasHandleInvalid,
         .. versionadded:: 1.4.0
         """
     
    +    stringOrderType = Param(Params._dummy(), "stringOrderType",
    +                            "How to order labels of string column. The first label after " +
    +                            "ordering is assigned an index of 0. Supported options: " +
    +                            "frequencyDesc, frequencyAsc, alphabetDsec, alphabetAsc.",
    +                            typeConverter=TypeConverters.toString)
    +
         @keyword_only
    -    def __init__(self, inputCol=None, outputCol=None, handleInvalid="error"):
    +    def __init__(self, inputCol=None, outputCol=None, handleInvalid="error",
    +                 stringOrderType="frequencyDesc"):
             """
    -        __init__(self, inputCol=None, outputCol=None, handleInvalid="error")
    +        __init__(self, inputCol=None, outputCol=None, handleInvalid="error", \
    --- End diff --
    
    Probably, the leading `\` could be removed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17978
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76967/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org