You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Oscar Delicaat (Jira)" <ji...@apache.org> on 2022/12/02 15:08:00 UTC

[jira] [Created] (SPARK-41370) Add data frame equality check for testing purposes.

Oscar Delicaat created SPARK-41370:
--------------------------------------

             Summary: Add data frame equality check for testing purposes.
                 Key: SPARK-41370
                 URL: https://issues.apache.org/jira/browse/SPARK-41370
             Project: Spark
          Issue Type: New Feature
          Components: PySpark
    Affects Versions: 3.3.0
            Reporter: Oscar Delicaat


We woud like to have the functionality as suggested in https://issues.apache.org/jira/browse/SPARK-28172 . It got closed by an unrelated story. The comment on the story 

> Wouldn't this require to execute both DataFrames and collect the data into driver side? When the datasets are large, it's very easy for users to shoot them in the foot. I won't do that without an explicit plan and design doc for all other operators.

does not make sense since this will be used for unit testing purposes and the dataframes compared will be small.

We are currently using Pandas `assert_frame_equal` https://github.com/pandas-dev/pandas/blob/v1.5.2/pandas/_testing/asserters.py#L1135-L1358 functionality which works very well. However, we are having issues with pandas since it only supports only a subset of the timestamps supported by Spark [https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-timestamp-limits.] This is the first time this became a blocker for us so therefore we would like to have the functionality to validate equality and get feedback on what is different similar to that in Spark. As far as I could see there is nothing in PySpark which currently supports does this.

Please let me know any feedback you have, thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org