You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "itholic (via GitHub)" <gi...@apache.org> on 2023/10/19 08:08:35 UTC

[PR] [SPARK-45554][PYTHON] Introduce flexible parameter to assertSchemaEqual [spark]

itholic opened a new pull request, #43450:
URL: https://github.com/apache/spark/pull/43450

   ### What changes were proposed in this pull request?
   
   This PR proposes to add three new parameters to the `assertSchemaEqual`: `ignoreNullable`, `ignoreColumnOrder` and `ignoreColumnName` to provide users with more flexibility in schema testing.
   
   
   ### Why are the changes needed?
   
   To enhance the utility of `assertSchemaEqual` by accommodating various common schema comparison scenarios that users might encounter, without necessitating manual adjustments or workarounds.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes. `assertDataFrameEqual` now have the option to use the five new parameters:
   <!DOCTYPE html>
   
   Parameter | Type | Comment
   -- | -- | --
   ignoreNullable | Boolean [optional] | Specifies whether a column’s nullable property is included when checking for schema equality.</br></br> When set to True (default), the nullable property of the columns being compared is not taken into account and the columns will be considered equal even if they have different nullable settings.</br></br>When set to False, columns are considered equal only if they have the same nullable setting.
   ignoreColumnOrder | Boolean [optional] | Specifies whether to compare columns in the order they appear in the DataFrames or by column name.</br></br> When set to False (default), columns are compared in the order they appear in the DataFrames.</br></br> When set to True, a column in the expected DataFrame is compared to the column with the same name in the actual DataFrame. </br></br>ignoreColumnOrder cannot be set to True if ignoreColumnNames is also set to True.
   ignoreColumnName | Boolean [optional] | Specifies whether to fail the initial schema equality check if the column names in the two DataFrames are different.</br></br> When set to False (default), column names are checked and the function fails if they are different.</br></br> When set to True, the function will succeed even if column names are different. Column data types are compared for columns in the order they appear in the DataFrames.</br></br> ignoreColumnNames cannot be set to True if ignoreColumnOrder is also set to True.
   
   
   
   
   ### How was this patch tested?
   
   Added usage examples into doctest for each parameter.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45554][PYTHON] Introduce flexible parameter to `assertSchemaEqual` [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #43450:
URL: https://github.com/apache/spark/pull/43450#discussion_r1373989122


##########
python/pyspark/testing/utils.py:
##########
@@ -311,11 +342,28 @@ def assertSchemaEqual(actual: StructType, expected: StructType):
 
     Examples
     --------
+    >>> from pyspark.pandas.utils import default_session
+    >>> spark = default_session()

Review Comment:
   Oh, it's dummy line from my local testing. Will remove it. Thanks for catching this out!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45554][PYTHON] Introduce flexible parameter to `assertSchemaEqual` [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on PR #43450:
URL: https://github.com/apache/spark/pull/43450#issuecomment-1784365051

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45554][PYTHON] Introduce flexible parameter to `assertSchemaEqual` [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on PR #43450:
URL: https://github.com/apache/spark/pull/43450#issuecomment-1771000613

   cc @HyukjinKwon @allanf-db 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45554][PYTHON] Introduce flexible parameter to `assertSchemaEqual` [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #43450:
URL: https://github.com/apache/spark/pull/43450#discussion_r1373989740


##########
python/pyspark/testing/utils.py:
##########
@@ -328,6 +376,26 @@ def assertSchemaEqual(actual: StructType, expected: StructType):
     ?                               ^^                               ^^^^^
     + StructType([StructField('id', StringType(), True), StructField('amount', LongType(), True)])
     ?                               ^^^^                              ++++ ^
+
+    Different schemas (ignoring column order)
+
+    >>> s1 = StructType(
+    ...     [StructField("a", IntegerType(), True), StructField("b", DoubleType(), True)]
+    ... )
+    >>> s2 = StructType(
+    ...     [StructField("b", DoubleType(), True), StructField("a", IntegerType(), True)]
+    ... )
+    >>> assertSchemaEqual(s1, s2, ignoreColumnOrder=True)
+
+    Different schemas (ignoring column names)

Review Comment:
   Applied the suggestions. Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45554][PYTHON] Introduce flexible parameter to `assertSchemaEqual` [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon closed pull request #43450: [SPARK-45554][PYTHON] Introduce flexible parameter to `assertSchemaEqual`
URL: https://github.com/apache/spark/pull/43450


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45554][PYTHON] Introduce flexible parameter to `assertSchemaEqual` [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on PR #43450:
URL: https://github.com/apache/spark/pull/43450#issuecomment-1778236396

   This also CI passed. Gentle reminder for @HyukjinKwon, also cc @ueshin @zhengruifeng .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45554][PYTHON] Introduce flexible parameter to `assertSchemaEqual` [spark]

Posted by "allisonwang-db (via GitHub)" <gi...@apache.org>.

allisonwang-db commented on code in PR #43450:
URL: https://github.com/apache/spark/pull/43450#discussion_r1373960309


##########
python/pyspark/testing/utils.py:
##########
@@ -328,6 +376,26 @@ def assertSchemaEqual(actual: StructType, expected: StructType):
     ?                               ^^                               ^^^^^
     + StructType([StructField('id', StringType(), True), StructField('amount', LongType(), True)])
     ?                               ^^^^                              ++++ ^
+
+    Different schemas (ignoring column order)
+
+    >>> s1 = StructType(
+    ...     [StructField("a", IntegerType(), True), StructField("b", DoubleType(), True)]
+    ... )
+    >>> s2 = StructType(
+    ...     [StructField("b", DoubleType(), True), StructField("a", IntegerType(), True)]
+    ... )
+    >>> assertSchemaEqual(s1, s2, ignoreColumnOrder=True)
+
+    Different schemas (ignoring column names)

Review Comment:
   ```suggestion
       Compare two schemas ignoring the column names
   ```



##########
python/pyspark/testing/utils.py:
##########
@@ -311,11 +342,28 @@ def assertSchemaEqual(actual: StructType, expected: StructType):
 
     Examples
     --------
+    >>> from pyspark.pandas.utils import default_session
+    >>> spark = default_session()

Review Comment:
   Why do we need this?



##########
python/pyspark/testing/utils.py:
##########
@@ -328,6 +376,26 @@ def assertSchemaEqual(actual: StructType, expected: StructType):
     ?                               ^^                               ^^^^^
     + StructType([StructField('id', StringType(), True), StructField('amount', LongType(), True)])
     ?                               ^^^^                              ++++ ^
+
+    Different schemas (ignoring column order)

Review Comment:
   ```suggestion
       Compare two schemas ignoring the column order
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org