You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by JasonMWhite <gi...@git.apache.org> on 2016/04/18 16:31:50 UTC

[GitHub] spark pull request: [SPARK-14700] [Python] adding SQL Row equality...

GitHub user JasonMWhite opened a pull request:

    https://github.com/apache/spark/pull/12470

    [SPARK-14700] [Python] adding SQL Row equality and inequality overrides

    ## What changes were proposed in this pull request?
    
    This PR adds equality and inequality overrides to `pyspark.sql.Row`. Without this override, `Row` simply uses the equality operator of `tuple`, which doesn't consider the private member `__fields__` of Row to determine the implicit ordering needed to compare the tuples correctly.
    
    Without this PR, we get seemingly illogical equality checks like:
    ```
    Row(a=1) == Row(b=1) # True
    r1 == Row('b', 'a')(2, 1) # Row(b=2, a=1)
    r1 == Row(b=2, a=1) # False
    r1 == Row(a=2, b=1) # True
    ```
    
    ## How was this patch tested?
    
    Unit tests were added to `pyspark/sql/types.py` in the `Row` docstring. A handful of newly failing tests in `pyspark/sql/tests.py` were fixed.
    
    This PR is my original work and I license the work to the project under the project's open source license.
    
    @davies @rxin Mind having a look?


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/JasonMWhite/spark row_equality

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/12470.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #12470
    
----
commit 986c8f27f34c53d44df59296042605d373bae7e7
Author: Jason White <ja...@shopify.com>
Date:   2016-04-18T05:10:51Z

    adding SQL Row equality and inequality overrides

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14700] [Python] adding SQL Row equality...

Posted by JasonMWhite <gi...@git.apache.org>.

Github user JasonMWhite commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12470#discussion_r60122211
  
    --- Diff: python/pyspark/sql/types.py ---
    @@ -1448,6 +1448,54 @@ def __repr__(self):
             else:
                 return "<Row(%s)>" % ", ".join(self)
     
    +    def __eq__(self, other):
    +        """
    +        Test for equality with `other`.
    +
    +        :param other: other Row for comparison
    +
    +        >>> Row(name="Alice", age=11) == Row(age=11, name="Alice")
    +        True
    +        >>> R1 = Row('a', 'b')
    +        >>> R2 = Row('b', 'a')
    +        >>> R1(1, 2) == R2(2, 1)
    --- End diff --
    
    I don't think I agree with this approach. It comes down to whether or not the ordering of the columns in the schema should matter, and I don't believe it should. If the ordering of the fields matters, then the Row constructor probably shouldn't automatically sort the provided kwargs alphabetically. This leads to very weird and nonintuitive results like:
    ```
    r1 = df.select('b', 'a').collect()[0]
    r1  # Row(b=2, a=1)
    r1 == Row(b=2, a=1) # False
    ```
    The only way to test equality properly would then be:
    ```
    r1 == Row('b', 'a')(2, 1) # True
    ```
    What benefit do we gain by making column ordering a factor in Row equality?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14700] [Python] adding SQL Row equality...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12470#discussion_r60103407
  
    --- Diff: python/pyspark/sql/types.py ---
    @@ -1448,6 +1448,54 @@ def __repr__(self):
             else:
                 return "<Row(%s)>" % ", ".join(self)
     
    +    def __eq__(self, other):
    +        """
    +        Test for equality with `other`.
    +
    +        :param other: other Row for comparison
    +
    +        >>> Row(name="Alice", age=11) == Row(age=11, name="Alice")
    +        True
    +        >>> R1 = Row('a', 'b')
    +        >>> R2 = Row('b', 'a')
    +        >>> R1(1, 2) == R2(2, 1)
    --- End diff --
    
    Row is not an dictionary, I'd like to return False in this case, because the schema is different.
    
    Right now, Row is more close to namedtuple.
    
    ```python
    >>> namedtuple("A", "a b")(1, 2) == namedtuple("B", "b a")(2, 1)
    ```
    False


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14700] [Python] adding SQL Row equality...

Posted by JasonMWhite <gi...@git.apache.org>.

Github user JasonMWhite closed the pull request at:

    https://github.com/apache/spark/pull/12470


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14700] [Python] adding SQL Row equality...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/12470#issuecomment-211589128
  
    @JasonMWhite you said in the jira ticket that there were a few bugs exposed. Can you submit those too?
    
    I talked more with @davies offline. I think the main problem here is that column names can have duplicates, and in those cases it is difficult to define what the correct behavior is if the equality check is column-order-agnostic.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14700] [Python] adding SQL Row equality...

Posted by JasonMWhite <gi...@git.apache.org>.

Github user JasonMWhite commented on the pull request:

    https://github.com/apache/spark/pull/12470#issuecomment-211712175
  
    After some experimentation on the Scala side, I see that this PR would be a significant departure from Scala's interface, where the schema and field names are not factors in equality checks at all. Column ordering is all that matters. That seems odd to me, but perhaps that changes in 2.0.
    
    There's a couple of other departures from Scala-style Rows:
    - Accessing an ambiguous column doesn't throw an error
    - You can't check row equality like `Row(1, 2) == Row(b=1, a=2)`
    
    This second issue makes unit tests a little counterintuitive. The use of kwargs at all, and making `Row` look dict-ish, might be an issue. I'll create separate issues for these. Closing PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14700] [Python] adding SQL Row equality...

Posted by JasonMWhite <gi...@git.apache.org>.

Github user JasonMWhite commented on the pull request:

    https://github.com/apache/spark/pull/12470#issuecomment-211600203
  
    The "few bugs" are included in this PR, all column naming issues. They're in `python/pyspark/sql/tests.py`
    
    Could you provide an example about what could cause these column name duplications? Is it something like `df.select('a', 'a')`, post-join behaviour, or something else?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14700] [Python] adding SQL Row equality...

Posted by JasonMWhite <gi...@git.apache.org>.

Github user JasonMWhite commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12470#discussion_r60134541
  
    --- Diff: python/pyspark/sql/types.py ---
    @@ -1448,6 +1448,54 @@ def __repr__(self):
             else:
                 return "<Row(%s)>" % ", ".join(self)
     
    +    def __eq__(self, other):
    +        """
    +        Test for equality with `other`.
    +
    +        :param other: other Row for comparison
    +
    +        >>> Row(name="Alice", age=11) == Row(age=11, name="Alice")
    +        True
    +        >>> R1 = Row('a', 'b')
    +        >>> R2 = Row('b', 'a')
    +        >>> R1(1, 2) == R2(2, 1)
    --- End diff --
    
    Although, carrying that to its logical conclusion, then `R1 == R2` should return `True` in this case, which it currently doesn't.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14700] [Python] adding SQL Row equality...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/12470#issuecomment-211636646
  
    Can they do it by position?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14700] [Python] adding SQL Row equality...

Posted by JasonMWhite <gi...@git.apache.org>.

Github user JasonMWhite commented on the pull request:

    https://github.com/apache/spark/pull/12470#issuecomment-211635125
  
    Hmm, I see what you mean.
    ```
    r1 = df.select('a', 'b', df['b'].alias('a')).collect()[0]
    r1 # Row(a=1, b=2, a=2)
    ```
    What does that even mean? And what should `r1['a']` return? Throw an exception for the ambiguity, perhaps? If so, how would the user disambiguate the columns again?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14700] [Python] adding SQL Row equality...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12470#issuecomment-211487266
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org