You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hugo (JIRA)" <ji...@apache.org> on 2019/06/26 13:10:00 UTC
[jira] [Created] (SPARK-28172) pyspark DataFrame equality operator
Hugo created SPARK-28172:
----------------------------
Summary: pyspark DataFrame equality operator
Key: SPARK-28172
URL: https://issues.apache.org/jira/browse/SPARK-28172
Project: Spark
Issue Type: Improvement
Components: PySpark
Affects Versions: 2.4.3
Reporter: Hugo
*Motivation:*
Facilitating testing equality between DataFrames. Many approaches for applying TDD practices in data science require checking function output against expected output (in similarity to API snapshots). Having an equality operator or something alike would make this easier.
A basic example:
{code}
from pyspark.ml.feature import Imputer
def create_mock_missing_df():
return spark.createDataFrame([("a", 1.0),("b", 2.0),("c", float('nan'))], ['COL1', 'COl2'])
def test_imputation():
"""Test mean value imputation"""
df1 = create_mock_missing_df()
#load snapshot
pickled_snapshot = sc.pickleFile('imputed_df.pkl').collect()
df2 = spark.createDataFrame(pickled_snapshot)
"""
>>> df2.show()
+----+------------+
|COL1|COL2_imputed|
+----+------------+
| a | 1.0 |
| b | 2.0 |
| c | 1.5 |
+----+------------+
"""
imputer = Imputer(
inputCols=['COL2'],
outputCols=['COL2_imputed']
)
df1 = imputer.fit(df1).transform(df1)
df1 = df1.drop('COL2')
assert df1 == df2
{code}
Suggested change:
{code}
class DataFrame(object):
...
def __eq__(self, other):
"""Returns ``True`` if DataFrame content is equal to other.
>>> df1 = spark.createDataFrame([("a", 1), ("b", 2), ("c", 3)])
>>> df2 = spark.createDataFrame([("b", 2), ("a", 1), ("c", 3)])
>>> df1 == df2
True
"""
return self.unionAll(other) \
.subtract(self.intersect(other)) \
.count() == 0
{code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org