You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by HyukjinKwon <gi...@git.apache.org> on 2017/05/02 03:52:29 UTC

[GitHub] spark pull request #17827: [SPARK-20552][SQL][PYTHON] Add isNotDistinctFrom/...

GitHub user HyukjinKwon opened a pull request:

    https://github.com/apache/spark/pull/17827

    [SPARK-20552][SQL][PYTHON] Add isNotDistinctFrom/isDistinctFrom for column APIs in Scala and Python

    ## What changes were proposed in this pull request?
    
    This PR proposes to add both `isNotDistinctFrom` and `isDistinctFrom` to both Scala and Python column APIs.
    
    `IS [NOT] DISTINCT FROM` syntax is now supported in favour of https://github.com/apache/spark/pull/17764
    
    Adding Python API was initially suggested in that PR but that PR turned to SQL syntax only change. Per https://github.com/apache/spark/pull/17764#discussion_r114048387 I assume we want this.
    
    ## How was this patch tested?
    
    Doctests for Python and unit tests in `ColumnExpressionSuite`.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark SPARK-20552

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17827.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17827
    
----
commit 008cec4ed1a45c299bec4a6ce6114e5871c0398c
Author: hyukjinkwon <gu...@gmail.com>
Date:   2017-05-02T03:28:04Z

    Add isNotDistinctFrom/isDistinctFrom for column APIs in Scala/Java and Python

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17827: [SPARK-20552][SQL][PYTHON] Add isNotDistinctFrom/isDisti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17827
  
    **[Test build #76370 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76370/testReport)** for PR 17827 at commit [`008cec4`](https://github.com/apache/spark/commit/008cec4ed1a45c299bec4a6ce6114e5871c0398c).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17827: [SPARK-20552][SQL][PYTHON] Add isNotDistinctFrom/isDisti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17827
  
    **[Test build #76374 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76374/testReport)** for PR 17827 at commit [`6d658d4`](https://github.com/apache/spark/commit/6d658d48cce745647422b38b864fa708a084de59).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17827: [SPARK-20552][SQL][PYTHON] Add isNotDistinctFrom/isDisti...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/17827
  
    `IS [NOT] DISTINCT FROM` is part of ANSI SQL, and thus, we decide to support it. I am not sure whether we need to add them into JAVA and Python column APIs after we already have `eqNullSafe`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17827: [SPARK-20552][SQL][PYTHON] Add isNotDistinctFrom/isDisti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17827
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76376/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17827: [SPARK-20552][SQL][PYTHON] Add isNotDistinctFrom/isDisti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17827
  
    **[Test build #76376 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76376/testReport)** for PR 17827 at commit [`6d658d4`](https://github.com/apache/spark/commit/6d658d48cce745647422b38b864fa708a084de59).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17827: [SPARK-20552][SQL][PYTHON] Add isNotDistinctFrom/...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17827#discussion_r114244470
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala ---
    @@ -284,23 +287,6 @@ class ColumnExpressionSuite extends QueryTest with SharedSQLContext {
     
       test("<=>") {
         checkAnswer(
    -      testData2.filter($"a" === 1),
    -      testData2.collect().toSeq.filter(r => r.getInt(0) == 1))
    -
    -    checkAnswer(
    -      testData2.filter($"a" === $"b"),
    -      testData2.collect().toSeq.filter(r => r.getInt(0) == r.getInt(1)))
    --- End diff --
    
    The test below:
    
    ```scala
       checkAnswer(
          testData2.filter($"a" === 1),
          testData2.collect().toSeq.filter(r => r.getInt(0) == 1))
    
        checkAnswer(
          testData2.filter($"a" === $"b"),
          testData2.collect().toSeq.filter(r => r.getInt(0) == r.getInt(1)))
    ```
    
    looked to me identical with the test for `===` above and not testing `<=>`. So I removed this as a duplicated test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17827: [SPARK-20552][SQL][PYTHON] Add isNotDistinctFrom/...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17827#discussion_r114244380
  
    --- Diff: python/pyspark/sql/column.py ---
    @@ -224,7 +224,39 @@ def __init__(self, jc):
            https://spark.apache.org/docs/latest/sql-programming-guide.html#nan-semantics
         .. versionadded:: 2.3.0
         """
    +    _isNotDistinctFrom_doc = _eqNullSafe_doc.replace("eqNullSafe", "isNotDistinctFrom")
    +    _isDistinctFrom_doc = """
    +    Inequality test that is safe for null values.
    +
    +    :param other: a value or :class:`Column`
    +
    +    >>> from pyspark.sql import Row
    +    >>> df1 = spark.createDataFrame([
    +    ...     Row(id=1, value='foo'),
    +    ...     Row(id=2, value=None)
    +    ... ])
    +    >>> df1.select(
    +    ...     df1['value'] != 'foo',
    +    ...     df1['value'].isDistinctFrom('foo'),
    +    ...     df1['value'].isDistinctFrom(None)
    +    ... ).show()
    +    +-------------------+---------------------+----------------------+
    +    |(NOT (value = foo))|(NOT (value <=> foo))|(NOT (value <=> NULL))|
    +    +-------------------+---------------------+----------------------+
    +    |              false|                false|                  true|
    +    |               null|                 true|                 false|
    +    +-------------------+---------------------+----------------------+
    +
    +    .. note:: Unlike Pandas, PySpark doesn't consider NaN values to be NULL.
    +       See the `NaN Semantics`_ for details.
    +    .. _NaN Semantics:
    +       https://spark.apache.org/docs/latest/sql-programming-guide.html#nan-semantics
    +    .. versionadded:: 2.3.0
    +    """
    --- End diff --
    
    ![2017-05-02 12 41 02](https://cloud.githubusercontent.com/assets/6477701/25603406/5de1c57a-2f36-11e7-8f62-fd20494a4434.png)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17827: [SPARK-20552][SQL][PYTHON] Add isNotDistinctFrom/isDisti...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/17827
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17827: [SPARK-20552][SQL][PYTHON] Add isNotDistinctFrom/...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17827#discussion_r114260664
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Column.scala ---
    @@ -475,6 +475,22 @@ class Column(val expr: Expression) extends Logging {
       def eqNullSafe(other: Any): Column = this <=> other
     
       /**
    +   * Equality test that is safe for null values.
    +   *
    +   * @group java_expr_ops
    --- End diff --
    
    Like `eqNullSafe`, they are normally used for JAVA APIs. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17827: [SPARK-20552][SQL][PYTHON] Add isNotDistinctFrom/isDisti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17827
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17827: [SPARK-20552][SQL][PYTHON] Add isNotDistinctFrom/isDisti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17827
  
    **[Test build #76376 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76376/testReport)** for PR 17827 at commit [`6d658d4`](https://github.com/apache/spark/commit/6d658d48cce745647422b38b864fa708a084de59).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17827: [SPARK-20552][SQL][PYTHON] Add isNotDistinctFrom/...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17827#discussion_r114244526
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala ---
    @@ -284,23 +287,6 @@ class ColumnExpressionSuite extends QueryTest with SharedSQLContext {
     
       test("<=>") {
         checkAnswer(
    -      testData2.filter($"a" === 1),
    -      testData2.collect().toSeq.filter(r => r.getInt(0) == 1))
    -
    -    checkAnswer(
    -      testData2.filter($"a" === $"b"),
    -      testData2.collect().toSeq.filter(r => r.getInt(0) == r.getInt(1)))
    -  }
    -
    -  test("=!=") {
    --- End diff --
    
    `=!=` test looked actually testing `<=>`. I switched this to `<=>` and created a test for `=!=` below separately.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17827: [SPARK-20552][SQL][PYTHON] Add isNotDistinctFrom/isDisti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17827
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76370/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17827: [SPARK-20552][SQL][PYTHON] Add isNotDistinctFrom/isDisti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17827
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17827: [SPARK-20552][SQL][PYTHON] Add isNotDistinctFrom/...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17827#discussion_r114244337
  
    --- Diff: python/pyspark/sql/column.py ---
    @@ -224,7 +224,39 @@ def __init__(self, jc):
            https://spark.apache.org/docs/latest/sql-programming-guide.html#nan-semantics
         .. versionadded:: 2.3.0
         """
    +    _isNotDistinctFrom_doc = _eqNullSafe_doc.replace("eqNullSafe", "isNotDistinctFrom")
    --- End diff --
    
    ![2017-05-02 12 41 27](https://cloud.githubusercontent.com/assets/6477701/25603393/3b0db43c-2f36-11e7-932c-135b7e7d4ca3.png)
    
    This is the same with `eqNullSafe` but only the word `eqNullSafe` was replaced to `isNotDistinctFrom`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17827: [SPARK-20552][SQL][PYTHON] Add isNotDistinctFrom/isDisti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17827
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76374/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17827: [SPARK-20552][SQL][PYTHON] Add isNotDistinctFrom/isDisti...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17827
  
    **[Test build #76370 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76370/testReport)** for PR 17827 at commit [`008cec4`](https://github.com/apache/spark/commit/008cec4ed1a45c299bec4a6ce6114e5871c0398c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17827: [SPARK-20552][SQL][PYTHON] Add isNotDistinctFrom/isDisti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17827
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17827: [SPARK-20552][SQL][PYTHON] Add isNotDistinctFrom/isDisti...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/17827
  
    cc @gatorsmile and @ptkool, could you take a look and see if it makes sense please?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17827: [SPARK-20552][SQL][PYTHON] Add isNotDistinctFrom/...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon closed the pull request at:

    https://github.com/apache/spark/pull/17827


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17827: [SPARK-20552][SQL][PYTHON] Add isNotDistinctFrom/isDisti...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/17827
  
    Yea, that is what I initially thought. I am closing this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org