You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Hugo (JIRA)" <ji...@apache.org> on 2019/06/26 13:10:00 UTC

[jira] [Created] (SPARK-28172) pyspark DataFrame equality operator

Hugo created SPARK-28172:
----------------------------

             Summary: pyspark DataFrame equality operator
                 Key: SPARK-28172
                 URL: https://issues.apache.org/jira/browse/SPARK-28172
             Project: Spark
          Issue Type: Improvement
          Components: PySpark
    Affects Versions: 2.4.3
            Reporter: Hugo


*Motivation:*
 Facilitating testing equality between DataFrames. Many approaches for applying TDD practices in data science require checking function output against expected output (in similarity to API snapshots). Having an equality operator or something alike would make this easier. 

A basic example:
{code}
from pyspark.ml.feature import Imputer

def create_mock_missing_df():
   return spark.createDataFrame([("a", 1.0),("b", 2.0),("c", float('nan'))],   ['COL1', 'COl2'])

def test_imputation():
   """Test mean value imputation"""

   df1 = create_mock_missing_df()

   #load snapshot
   pickled_snapshot = sc.pickleFile('imputed_df.pkl').collect()
   df2 = spark.createDataFrame(pickled_snapshot)
   """
   >>> df2.show()
   +----+------------+
   |COL1|COL2_imputed|
   +----+------------+
   | a  | 1.0        |
   | b  | 2.0        |
   | c  | 1.5        |
   +----+------------+

   """ 
   imputer = Imputer(
      inputCols=['COL2'],
      outputCols=['COL2_imputed']
   )
   df1 = imputer.fit(df1).transform(df1)
   df1 = df1.drop('COL2')

   assert df1 == df2
{code}
 
 Suggested change:
{code}
class DataFrame(object):
   ...

   def __eq__(self, other):
      """Returns ``True`` if DataFrame content is equal to other.

      >>> df1 = spark.createDataFrame([("a", 1), ("b", 2), ("c", 3)])
      >>> df2 = spark.createDataFrame([("b", 2), ("a", 1), ("c", 3)])

      >>> df1 == df2
      True
      """
      return self.unionAll(other) \
         .subtract(self.intersect(other)) \
         .count() == 0
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org