You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by zero323 <gi...@git.apache.org> on 2017/02/03 19:48:04 UTC

[GitHub] spark pull request #16792: [SPARK-19453][PYTHON][SQL][DOC] Correct and exten...

GitHub user zero323 opened a pull request:

    https://github.com/apache/spark/pull/16792

    [SPARK-19453][PYTHON][SQL][DOC] Correct and extend DataFrame.replace docstring

    ## What changes were proposed in this pull request?
    
    - Provides correct description of the semantics of a `dict` argument passed as `to_replace`.
    - Describes type requirements for collection arguments.
    - Describes behavior with `to_replace: List[T]` and `value: T`
    
    ## How was this patch tested?
    
    Manual testing, documentation build.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/zero323/spark SPARK-19453

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16792.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16792
    
----
commit 132b165d312b263827c9a47527ce9bf00ac68c68
Author: zero323 <ze...@users.noreply.github.com>
Date:   2017-02-01T23:51:05Z

    Clarify DataFrame.replace behavior with dict as to_replace

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16792: [SPARK-19453][PYTHON][SQL][DOC] Correct and exten...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16792#discussion_r99464297
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -1272,16 +1272,18 @@ def replace(self, to_replace, value, subset=None):
             """Returns a new :class:`DataFrame` replacing a value with another value.
             :func:`DataFrame.replace` and :func:`DataFrameNaFunctions.replace` are
             aliases of each other.
    +        Values `to_replace` and `value` should be homogeneous. Mixed string and numeric
    --- End diff --
    
    Would this maybe be clearer with "The types of ..." rather than "Values ..."?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16792: [SPARK-19453][PYTHON][SQL][DOC] Correct and extend DataF...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/16792
  
    thanks for putting in the time to come up with a clear warning about how the types are handled.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16792: [SPARK-19453][PYTHON][SQL][DOC] Correct and exten...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16792#discussion_r99470045
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -1272,16 +1272,18 @@ def replace(self, to_replace, value, subset=None):
             """Returns a new :class:`DataFrame` replacing a value with another value.
             :func:`DataFrame.replace` and :func:`DataFrameNaFunctions.replace` are
             aliases of each other.
    +        Values `to_replace` and `value` should be homogeneous. Mixed string and numeric
    +        replacements are not supported.
     
    -        :param to_replace: int, long, float, string, or list.
    +        :param to_replace: int, long, float, string, list or dict.
                 Value to be replaced.
                 If the value is a dict, then `value` is ignored and `to_replace` must be a
    -            mapping from column name (string) to replacement value. The value to be
    -            replaced must be an int, long, float, or string.
    +            mapping between a value and a replacement.
             :param value: int, long, float, string, or list.
    -            Value to use to replace holes.
                 The replacement value must be an int, long, float, or string. If `value` is a
    -            list or tuple, `value` should be of the same length with `to_replace`.
    +            list, `value` should be of the same length and type as `to_replace`.
    +            If `value` is a scalar and `to_replace` is a sequence, then `value` is replicated
    --- End diff --
    
    This is true, but perhaps it would be clearer to say if value is a scalar and to_replace is a sequence then value is used as the replacement for every instance in to_replace. (Since that is the effect the user sees - they don't really need to know that we are copying it internally in a list).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16792: [SPARK-19453][PYTHON][SQL][DOC] Correct and exten...

Posted by zero323 <gi...@git.apache.org>.
Github user zero323 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16792#discussion_r99473000
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -1272,16 +1272,18 @@ def replace(self, to_replace, value, subset=None):
             """Returns a new :class:`DataFrame` replacing a value with another value.
             :func:`DataFrame.replace` and :func:`DataFrameNaFunctions.replace` are
             aliases of each other.
    +        Values `to_replace` and `value` should be homogeneous. Mixed string and numeric
    --- End diff --
    
    Oh, you mean when we use higher precision type with lower precision column. Not really an issue with `replace`, don't you think? Maybe something like _`value` will be truncated when used with lower precision column_? I doubt we can make it precise and comprehensible at the same time and the main goal should be to signal possible issues while keeping things relatively clear. After all there are four different signatures to cover here.
    
    I've missed boolean case. I keep forgetting that `True` is instance of  `int`.  Added to the other PR, with tests for mixed numeric types.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16792: [SPARK-19453][PYTHON][SQL][DOC] Correct and extend DataF...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16792
  
    **[Test build #72317 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72317/testReport)** for PR 16792 at commit [`132b165`](https://github.com/apache/spark/commit/132b165d312b263827c9a47527ce9bf00ac68c68).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16792: [SPARK-19453][PYTHON][SQL][DOC] Correct and extend DataF...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16792
  
    **[Test build #72317 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72317/testReport)** for PR 16792 at commit [`132b165`](https://github.com/apache/spark/commit/132b165d312b263827c9a47527ce9bf00ac68c68).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16792: [SPARK-19453][PYTHON][SQL][DOC] Correct and extend DataF...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16792
  
    **[Test build #72391 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72391/testReport)** for PR 16792 at commit [`9f73508`](https://github.com/apache/spark/commit/9f73508016d276902937423492d8864b664c092b).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16792: [SPARK-19453][PYTHON][SQL][DOC] Correct and exten...

Posted by zero323 <gi...@git.apache.org>.
Github user zero323 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16792#discussion_r99486909
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -1272,16 +1272,18 @@ def replace(self, to_replace, value, subset=None):
             """Returns a new :class:`DataFrame` replacing a value with another value.
             :func:`DataFrame.replace` and :func:`DataFrameNaFunctions.replace` are
             aliases of each other.
    +        Values `to_replace` and `value` should be homogeneous. Mixed string and numeric
    --- End diff --
    
    Challenge accepted :) 
    
    This makes me think we should also document uniqueness requirements. User might expect that:
    
    ```
    df.replace([1, 1.0], [2, 3.0]) 
    ```
    
    or 
    
    ```
    df.replace({1: 2, 1.0: 3.0})
    ```
    when in fact there will be only one pair considered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16792: [SPARK-19453][PYTHON][SQL][DOC] Correct and extend DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16792
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72441/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16792: [SPARK-19453][PYTHON][SQL][DOC] Correct and exten...

Posted by zero323 <gi...@git.apache.org>.
Github user zero323 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16792#discussion_r99470996
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -1272,16 +1272,18 @@ def replace(self, to_replace, value, subset=None):
             """Returns a new :class:`DataFrame` replacing a value with another value.
             :func:`DataFrame.replace` and :func:`DataFrameNaFunctions.replace` are
             aliases of each other.
    +        Values `to_replace` and `value` should be homogeneous. Mixed string and numeric
    --- End diff --
    
    I've seen some variation between bundled Py4j (`py4j-0.10.4-src.zip`) and pip installed version (should be the same `py4j==0.10.4`) which suggests there is something more going on here but I would really avoid restricting Python users even more. 
    
    If necessary we can always proactively cast all `ints` to `floats`. It will be an ugly hack but won't affect user code or correctness.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16792: [SPARK-19453][PYTHON][SQL][DOC] Correct and extend DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16792
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16792: [SPARK-19453][PYTHON][SQL][DOC] Correct and extend DataF...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/16792
  
    Thanks for putting in the time to get this really well documented :) Merged to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16792: [SPARK-19453][PYTHON][SQL][DOC] Correct and extend DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16792
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16792: [SPARK-19453][PYTHON][SQL][DOC] Correct and extend DataF...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16792
  
    **[Test build #72441 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72441/testReport)** for PR 16792 at commit [`2bd4417`](https://github.com/apache/spark/commit/2bd441728e4263b4efc8be1fda87d502fb0daf8b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16792: [SPARK-19453][PYTHON][SQL][DOC] Correct and extend DataF...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/16792
  
    lgtm


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16792: [SPARK-19453][PYTHON][SQL][DOC] Correct and exten...

Posted by zero323 <gi...@git.apache.org>.
Github user zero323 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16792#discussion_r99470847
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -1272,16 +1272,18 @@ def replace(self, to_replace, value, subset=None):
             """Returns a new :class:`DataFrame` replacing a value with another value.
             :func:`DataFrame.replace` and :func:`DataFrameNaFunctions.replace` are
             aliases of each other.
    +        Values `to_replace` and `value` should be homogeneous. Mixed string and numeric
    +        replacements are not supported.
     
    -        :param to_replace: int, long, float, string, or list.
    +        :param to_replace: int, long, float, string, list or dict.
                 Value to be replaced.
                 If the value is a dict, then `value` is ignored and `to_replace` must be a
    -            mapping from column name (string) to replacement value. The value to be
    -            replaced must be an int, long, float, or string.
    +            mapping between a value and a replacement.
             :param value: int, long, float, string, or list.
    -            Value to use to replace holes.
                 The replacement value must be an int, long, float, or string. If `value` is a
    -            list or tuple, `value` should be of the same length with `to_replace`.
    +            list, `value` should be of the same length and type as `to_replace`.
    +            If `value` is a scalar and `to_replace` is a sequence, then `value` is replicated
    --- End diff --
    
    Much better :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16792: [SPARK-19453][PYTHON][SQL][DOC] Correct and extend DataF...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16792
  
    **[Test build #72439 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72439/testReport)** for PR 16792 at commit [`78f04dc`](https://github.com/apache/spark/commit/78f04dc8a05c1b0373335f26464411e4a681729a).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16792: [SPARK-19453][PYTHON][SQL][DOC] Correct and extend DataF...

Posted by zero323 <gi...@git.apache.org>.
Github user zero323 commented on the issue:

    https://github.com/apache/spark/pull/16792
  
    Thanks @rxin.
    
    This is related to  #16793 / [SPARK-19454](https://issues.apache.org/jira/browse/SPARK-19454)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16792: [SPARK-19453][PYTHON][SQL][DOC] Correct and extend DataF...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16792
  
    **[Test build #72829 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72829/testReport)** for PR 16792 at commit [`d405e91`](https://github.com/apache/spark/commit/d405e919661258524b3d68a4f21f0a658f427452).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16792: [SPARK-19453][PYTHON][SQL][DOC] Correct and extend DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16792
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72439/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16792: [SPARK-19453][PYTHON][SQL][DOC] Correct and extend DataF...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/16792
  
    Can you update this to the latest master? (The merge tool has some warnings).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16792: [SPARK-19453][PYTHON][SQL][DOC] Correct and exten...

Posted by zero323 <gi...@git.apache.org>.
Github user zero323 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16792#discussion_r99470824
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -1272,16 +1272,18 @@ def replace(self, to_replace, value, subset=None):
             """Returns a new :class:`DataFrame` replacing a value with another value.
             :func:`DataFrame.replace` and :func:`DataFrameNaFunctions.replace` are
             aliases of each other.
    +        Values `to_replace` and `value` should be homogeneous. Mixed string and numeric
    --- End diff --
    
    Do you have a reproducible example of this behavior? Scala seems to accept mixed numerics (though it is not documented):
    ```scala
    val df = Seq(("Alice", 1, 4.1)).toDF
    df.na.replace(df.columns, Map[Any, Any](1 -> 2, 4.1 -> 5.1)).
    ```
    and Python equivalent seems to work for me as well. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16792: [SPARK-19453][PYTHON][SQL][DOC] Correct and exten...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16792#discussion_r99478139
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -1272,16 +1272,18 @@ def replace(self, to_replace, value, subset=None):
             """Returns a new :class:`DataFrame` replacing a value with another value.
             :func:`DataFrame.replace` and :func:`DataFrameNaFunctions.replace` are
             aliases of each other.
    +        Values `to_replace` and `value` should be homogeneous. Mixed string and numeric
    --- End diff --
    
    We can leave the warning about truncation off if you think its unnecessary - but I do like the description you came up with for it.
    
    I think the current proposed text on the expected types is a little vague (as you said) and I think we can have something that is precise, accurate, and easy to read so I'd like us to give it a shot :)
    
    How about something like: "The element(s) `to_replace` and `value` should be the same type(s) (either all numerics, all booleans, or all strings)." This way we've clarified that mixing the different numerics is ok since it all gets converted to doubles at the end of the day? I'm open to other suggestions though too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16792: [SPARK-19453][PYTHON][SQL][DOC] Correct and extend DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16792
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16792: [SPARK-19453][PYTHON][SQL][DOC] Correct and extend DataF...

Posted by zero323 <gi...@git.apache.org>.
Github user zero323 commented on the issue:

    https://github.com/apache/spark/pull/16792
  
    Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16792: [SPARK-19453][PYTHON][SQL][DOC] Correct and extend DataF...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16792
  
    **[Test build #72829 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72829/testReport)** for PR 16792 at commit [`d405e91`](https://github.com/apache/spark/commit/d405e919661258524b3d68a4f21f0a658f427452).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16792: [SPARK-19453][PYTHON][SQL][DOC] Correct and exten...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/16792


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16792: [SPARK-19453][PYTHON][SQL][DOC] Correct and extend DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16792
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16792: [SPARK-19453][PYTHON][SQL][DOC] Correct and exten...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16792#discussion_r99471688
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -1272,16 +1272,18 @@ def replace(self, to_replace, value, subset=None):
             """Returns a new :class:`DataFrame` replacing a value with another value.
             :func:`DataFrame.replace` and :func:`DataFrameNaFunctions.replace` are
             aliases of each other.
    +        Values `to_replace` and `value` should be homogeneous. Mixed string and numeric
    --- End diff --
    
    I don't think we need to cast the types - if you look inside of `replace0` all of the numerics are turned into doubles in the map (but we should probably - in your other PR - add a test around that so that if the internals change we know we need to update the Python side).
    
    Doing `sc.parallelize([Row(name='Alice', age=0, height=80)]).toDF().replace(0, 12.5).collect()` is what I was talking about cutting of the deciminal component (so while it runs it arguable doesn't do what the user expects - but that ).
    
    What about something along the lines of: `to_replace` and `value` should contain either all numerics, all booleans, or all strings. When replacing, the new value will be cast to the type of the existing column."
    
    I think this more clearly communicates the requirements, but is still a bit awkward -- can you think of something better?)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16792: [SPARK-19453][PYTHON][SQL][DOC] Correct and extend DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16792
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72391/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16792: [SPARK-19453][PYTHON][SQL][DOC] Correct and extend DataF...

Posted by zero323 <gi...@git.apache.org>.
Github user zero323 commented on the issue:

    https://github.com/apache/spark/pull/16792
  
    cc @rxin, @holdenk 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16792: [SPARK-19453][PYTHON][SQL][DOC] Correct and exten...

Posted by zero323 <gi...@git.apache.org>.
Github user zero323 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16792#discussion_r99572943
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -1272,16 +1272,18 @@ def replace(self, to_replace, value, subset=None):
             """Returns a new :class:`DataFrame` replacing a value with another value.
             :func:`DataFrame.replace` and :func:`DataFrameNaFunctions.replace` are
             aliases of each other.
    +        Values `to_replace` and `value` should be homogeneous. Mixed string and numeric
    --- End diff --
    
    After a quick survey it looks like both behaviors (truncate and fp uniqueness) can be confusing so let's add both.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16792: [SPARK-19453][PYTHON][SQL][DOC] Correct and extend DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16792
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72829/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16792: [SPARK-19453][PYTHON][SQL][DOC] Correct and exten...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16792#discussion_r99470201
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -1272,16 +1272,18 @@ def replace(self, to_replace, value, subset=None):
             """Returns a new :class:`DataFrame` replacing a value with another value.
             :func:`DataFrame.replace` and :func:`DataFrameNaFunctions.replace` are
             aliases of each other.
    +        Values `to_replace` and `value` should be homogeneous. Mixed string and numeric
    --- End diff --
    
    It's true that it will succeed when passed a float for an int, but it chops the float into an int which I suspect would be confusing if anyone passed in something with a non-zero decimal component. If you look at the documentation in Scala side it says "Key and value of `replacement` map must have the same type" 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16792: [SPARK-19453][PYTHON][SQL][DOC] Correct and exten...

Posted by zero323 <gi...@git.apache.org>.
Github user zero323 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16792#discussion_r99465189
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -1272,16 +1272,18 @@ def replace(self, to_replace, value, subset=None):
             """Returns a new :class:`DataFrame` replacing a value with another value.
             :func:`DataFrame.replace` and :func:`DataFrameNaFunctions.replace` are
             aliases of each other.
    +        Values `to_replace` and `value` should be homogeneous. Mixed string and numeric
    --- End diff --
    
    @holdenk I am not convinced. I am intentionally trying to be a bit vague to avoid suggestion that we support only `int` -> `int`or `float` -> `float`. We cannot use abstract base class (`numbers.Real`?) because we depend on types, not interfaces.
    
    But if you feel strongly about it I'll trust your judgment. I would probably write `TypeVar('T', str, Union[float, int])` but I am pretty sure it wouldn't be welcome :) and I hope the other PR will make the requirements explicit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16792: [SPARK-19453][PYTHON][SQL][DOC] Correct and extend DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16792
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72317/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16792: [SPARK-19453][PYTHON][SQL][DOC] Correct and extend DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16792
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16792: [SPARK-19453][PYTHON][SQL][DOC] Correct and extend DataF...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16792
  
    **[Test build #72391 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72391/testReport)** for PR 16792 at commit [`9f73508`](https://github.com/apache/spark/commit/9f73508016d276902937423492d8864b664c092b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16792: [SPARK-19453][PYTHON][SQL][DOC] Correct and extend DataF...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16792
  
    **[Test build #72439 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72439/testReport)** for PR 16792 at commit [`78f04dc`](https://github.com/apache/spark/commit/78f04dc8a05c1b0373335f26464411e4a681729a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16792: [SPARK-19453][PYTHON][SQL][DOC] Correct and extend DataF...

Posted by zero323 <gi...@git.apache.org>.
Github user zero323 commented on the issue:

    https://github.com/apache/spark/pull/16792
  
    Thanks @holdenk 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16792: [SPARK-19453][PYTHON][SQL][DOC] Correct and extend DataF...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16792
  
    **[Test build #72441 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72441/testReport)** for PR 16792 at commit [`2bd4417`](https://github.com/apache/spark/commit/2bd441728e4263b4efc8be1fda87d502fb0daf8b).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16792: [SPARK-19453][PYTHON][SQL][DOC] Correct and extend DataF...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/16792
  
    That sounds reasonable then :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org