You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Jason White (JIRA)" <ji...@apache.org> on 2016/04/18 16:20:25 UTC

[jira] [Created] (SPARK-14700) PySpark Row equality operator is not overridden

Jason White created SPARK-14700:
-----------------------------------

             Summary: PySpark Row equality operator is not overridden
                 Key: SPARK-14700
                 URL: https://issues.apache.org/jira/browse/SPARK-14700
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 1.6.1
            Reporter: Jason White


The pyspark.sql.Row class doesn't override the equality operator. As a result, it uses the superclass's equality operator, `tuple`. This is insufficient, as the order of the elements in the tuple are meant to be used in combination with the private `__fields__` member.

This leads to difficulties in preparing proper unit tests in PySpark DataFrames. It leads to seemingly illogical conditions such as:
Row(a=1) == Row(b=1) # True, since column names aren't considered
r1 = Row('b', 'a')(2, 1) # Row(b=2, a=1)
r1 == Row(b=2, a=1) # False, since kwarg operators are sorted alphabetically in the Row constructor
r1 == Row(a=2, b=1) # True, since the tuple for each is (2, 1)

Indeed, a few bugs in existing Spark code were exposed when I patched this. PR incoming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org