You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/06/27 02:25:00 UTC

[jira] [Resolved] (SPARK-28172) pyspark DataFrame equality operator

     [ https://issues.apache.org/jira/browse/SPARK-28172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-28172.
----------------------------------
    Resolution: Duplicate

> pyspark DataFrame equality operator
> -----------------------------------
>
>                 Key: SPARK-28172
>                 URL: https://issues.apache.org/jira/browse/SPARK-28172
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 2.4.3
>            Reporter: Hugo
>            Priority: Minor
>
> *Motivation:*
>  Facilitating testing equality between DataFrames. Many approaches for applying TDD practices in data science require checking function output against expected output (in similarity to API snapshots). Having an equality operator or something alike would make this easier. 
> A basic example:
> {code}
> from pyspark.ml.feature import Imputer
> def create_mock_missing_df():
>    return spark.createDataFrame([("a", 1.0),("b", 2.0),("c", float('nan'))],   ['COL1', 'COl2'])
> def test_imputation():
>    """Test mean value imputation"""
>    df1 = create_mock_missing_df()
>    #load snapshot
>    pickled_snapshot = sc.pickleFile('imputed_df.pkl').collect()
>    df2 = spark.createDataFrame(pickled_snapshot)
>    """
>    >>> df2.show()
>    +----+------------+
>    |COL1|COL2_imputed|
>    +----+------------+
>    | a  | 1.0        |
>    | b  | 2.0        |
>    | c  | 1.5        |
>    +----+------------+
>    """ 
>    imputer = Imputer(
>       inputCols=['COL2'],
>       outputCols=['COL2_imputed']
>    )
>    df1 = imputer.fit(df1).transform(df1)
>    df1 = df1.drop('COL2')
>    assert df1 == df2
> {code}
>  
>  Suggested change:
> {code}
> class DataFrame(object):
>    ...
>    def __eq__(self, other):
>       """Returns ``True`` if DataFrame content is equal to other.
>       >>> df1 = spark.createDataFrame([("a", 1), ("b", 2), ("c", 3)])
>       >>> df2 = spark.createDataFrame([("b", 2), ("a", 1), ("c", 3)])
>       >>> df1 == df2
>       True
>       """
>       return self.unionAll(other) \
>          .subtract(self.intersect(other)) \
>          .count() == 0
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org