You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jason White (JIRA)" <ji...@apache.org> on 2016/04/18 16:20:25 UTC
[jira] [Created] (SPARK-14700) PySpark Row equality operator is not
overridden
Jason White created SPARK-14700:
-----------------------------------
Summary: PySpark Row equality operator is not overridden
Key: SPARK-14700
URL: https://issues.apache.org/jira/browse/SPARK-14700
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 1.6.1
Reporter: Jason White
The pyspark.sql.Row class doesn't override the equality operator. As a result, it uses the superclass's equality operator, `tuple`. This is insufficient, as the order of the elements in the tuple are meant to be used in combination with the private `__fields__` member.
This leads to difficulties in preparing proper unit tests in PySpark DataFrames. It leads to seemingly illogical conditions such as:
Row(a=1) == Row(b=1) # True, since column names aren't considered
r1 = Row('b', 'a')(2, 1) # Row(b=2, a=1)
r1 == Row(b=2, a=1) # False, since kwarg operators are sorted alphabetically in the Row constructor
r1 == Row(a=2, b=1) # True, since the tuple for each is (2, 1)
Indeed, a few bugs in existing Spark code were exposed when I patched this. PR incoming.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org