You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "itholic (via GitHub)" <gi...@apache.org> on 2023/10/18 11:29:22 UTC

[PR] [SPARK-45552][PS] Introduce flexible parameters to `assertDataFrameEqual` [spark]

itholic opened a new pull request, #43433:
URL: https://github.com/apache/spark/pull/43433

   
   ### What changes were proposed in this pull request?
   
   This PR proposes to add five new parameters to the `assertDataFrameEqual`: `ignoreColumnOrder`, `ignoreColumnName`, `ignoreColumnType`, `maxErrors`, and `showOnlyDiff` to provide users with more flexibility in DataFrame testing.
   
   
   ### Why are the changes needed?
   
   To enhance the utility of `assertDataFrameEqual` by accommodating various common DataFrame comparison scenarios that users might encounter, without necessitating manual adjustments or workarounds.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes. `assertDataFrameEqual` now have the option to use the five new parameters:
   <!DOCTYPE html>
   
   Parameter | Type | Comment
   -- | -- | --
   ignoreColumnOrder | Boolean [optional] | Specifies whether to compare columns in the order they appear in the DataFrames or by column name.</br></br> When set to False (default), columns are compared in the order they appear in the DataFrames.</br></br> When set to True, a column in the expected DataFrame is compared to the column with the same name in the actual DataFrame. </br></br>ignoreColumnOrder cannot be set to True if ignoreColumnNames is also set to True.
   ignoreColumnName | Boolean [optional] | Specifies whether to fail the initial schema equality check if the column names in the two DataFrames are different.</br></br> When set to False (default), column names are checked and the function fails if they are different.</br></br> When set to True, the function will succeed even if column names are different. Column data types are compared for columns in the order they appear in the DataFrames.</br></br> ignoreColumnNames cannot be set to True if ignoreColumnOrder is also set to True.
   ignoreColumnType | Boolean [optional] | Specifies whether to ignore the data type of the columns when comparing.</br></br> When set to False (default), column data types are checked and the function fails if they are different.</br></br> When set to True, the schema equality check will succeed even if column data types are different and the function will attempt to compare rows.
   maxErrors | Integer [optional] | The maximum number of row comparison failures to encounter before returning.</br></br> When this number of row comparisons have failed, the function returns independent of how many rows have been compared.</br></br> Set to None by default which means compare all rows independent of number of failures.
   showOnlyDiff | Boolean [optional] | If set to True, the error message will only include rows that are different.</br></br> If set to False (default), the error message will include all rows (when there is at least one row that is different).
   
   
   
   
   ### How was this patch tested?
   
   Added usage examples into doctest for each parameter.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45552][PS] Introduce flexible parameters to `assertDataFrameEqual` [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.
itholic commented on PR #43433:
URL: https://github.com/apache/spark/pull/43433#issuecomment-1778234604

   Gentle reminder for @HyukjinKwon as CI passed. Also cc @ueshin @zhengruifeng 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45552][PS] Introduce flexible parameters to `assertDataFrameEqual` [spark]

Posted by "allisonwang-db (via GitHub)" <gi...@apache.org>.
allisonwang-db commented on code in PR #43433:
URL: https://github.com/apache/spark/pull/43433#discussion_r1373961866


##########
python/pyspark/testing/utils.py:
##########
@@ -396,6 +398,12 @@ def assertDataFrameEqual(
     checkRowOrder: bool = False,
     rtol: float = 1e-5,
     atol: float = 1e-8,
+    ignoreNullable: bool = True,
+    ignoreColumnOrder: bool = False,
+    ignoreColumnName: bool = False,
+    ignoreColumnType: bool = False,

Review Comment:
   have we considered `ignoreSchema`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45552][PS] Introduce flexible parameters to `assertDataFrameEqual` [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #43433:
URL: https://github.com/apache/spark/pull/43433#issuecomment-1784365021

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45552][PS] Introduce flexible parameters to `assertDataFrameEqual` [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon closed pull request #43433: [SPARK-45552][PS] Introduce flexible parameters to `assertDataFrameEqual`
URL: https://github.com/apache/spark/pull/43433


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45552][PS] Introduce flexible parameters to `assertDataFrameEqual` [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.
itholic commented on PR #43433:
URL: https://github.com/apache/spark/pull/43433#issuecomment-1768253256

   cc @HyukjinKwon @allanf-db FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45552][PS] Introduce flexible parameters to `assertDataFrameEqual` [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.
itholic commented on code in PR #43433:
URL: https://github.com/apache/spark/pull/43433#discussion_r1373982314


##########
python/pyspark/testing/utils.py:
##########
@@ -396,6 +398,12 @@ def assertDataFrameEqual(
     checkRowOrder: bool = False,
     rtol: float = 1e-5,
     atol: float = 1e-8,
+    ignoreNullable: bool = True,
+    ignoreColumnOrder: bool = False,
+    ignoreColumnName: bool = False,
+    ignoreColumnType: bool = False,

Review Comment:
   It sounds worth to have it for some scenarios. Let's discuss in the separate thread!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org