You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "allisonwang-db (via GitHub)" <gi...@apache.org> on 2023/07/14 16:34:24 UTC

[GitHub] [spark] allisonwang-db commented on a diff in pull request #41927: [SPARK-44216] [PYTHON] Make assertSchemaEqual API public

allisonwang-db commented on code in PR #41927:
URL: https://github.com/apache/spark/pull/41927#discussion_r1263911806


##########
python/pyspark/testing/utils.py:
##########
@@ -30,6 +30,7 @@
     Tuple,
 )
 from itertools import zip_longest
+import difflib

Review Comment:
   super nit: group it with the other `import ...` statement?



##########
python/pyspark/sql/tests/test_utils.py:
##########
@@ -921,22 +931,186 @@ def test_diff_schema_lens(self):
             schema=["id", "amount", "letter"],
         )
 
+        if is_remote():

Review Comment:
   Instead of checking `is_remote` here, can we split it into two test cases extending from this mixin?
   
   ```
   class BaseUtilsTestsMixin:
       def _get_schema_tree_string(self):
           ...
   
   class UtilsTests(BaseUtilsTestsMixin, ReusedSQLTestCase):
       ...
   
   # Under the connect test directory: python/pyspark/sql/tests/connect/test_parity_utils.py
   class UtilsParityTests(BaseUDFTestsMixin, ReusedConnectTestCase):
       def _get_schema_tree_string(self):
           ...
   ```
    And we can add helper functions to get the schema string and override it in connect test cases.



##########
python/pyspark/testing/utils.py:
##########
@@ -222,6 +223,114 @@ def check_error(
         )
 
 
+def assertSchemaEqual(actual: StructType, expected: StructType):
+    """
+    A util function to assert equality between DataFrame schemas `actual` and `expected`.

Review Comment:
   Maybe we should also update the param name for `assertDataFrameEqual`.



##########
python/pyspark/testing/utils.py:
##########
@@ -222,6 +223,114 @@ def check_error(
         )
 
 
+def assertSchemaEqual(actual: StructType, expected: StructType):
+    """
+    A util function to assert equality between DataFrame schemas `actual` and `expected`.
+
+    .. versionadded:: 3.5.0
+
+    Parameters
+    ----------
+    actual : StructType
+        The DataFrame schema that is being compared or tested.
+    expected : StructType
+        The expected schema, for comparison with the actual schema.
+
+    Examples
+    --------
+    >>> from pyspark.sql.types import StructType, StructField, ArrayType, IntegerType, DoubleType
+    >>> s1 = StructType([StructField("names", ArrayType(DoubleType(), True), True)])
+    >>> s2 = StructType([StructField("names", ArrayType(DoubleType(), True), True)])
+    >>> assertSchemaEqual(s1, s2)
+
+    Pass, schemas are identical
+
+    >>> s1 = StructType([StructField("names", StructType(
+    ...                                         [StructField("age", DoubleType(), True)]), True)])
+    >>> s2 = StructType([StructField("first name", StructType([StructField(
+    ...                                                      "age", IntegerType(), True)]), True)])
+    >>> assertSchemaEqual(s1, s2)  # fail  # doctest: +IGNORE_EXCEPTION_DETAIL
+    Traceback (most recent call last):
+    ...
+    PySparkAssertionError: [DIFFERENT_SCHEMA] Schemas do not match.
+    The diff below overlays `actual` and `expected` schema tree strings.
+    - indicates a line that should be removed from `actual` to match `expected`.
+    + indicates a line that should be added to `actual` to match `expected`.
+      root
+    -  |-- names: struct (nullable = true)
+    ?          -
+
+    +  |-- first name: struct (nullable = true)
+    ?     ++++++
+
+    -  |    |-- age: double (nullable = true)
+    ?                ^^^^^
+
+    +  |    |-- age: integer (nullable = true)
+    ?                ^^^ +++

Review Comment:
   Hmm why do we have both `^^^` and `+++`?



##########
python/pyspark/testing/utils.py:
##########
@@ -222,6 +223,114 @@ def check_error(
         )
 
 
+def assertSchemaEqual(actual: StructType, expected: StructType):
+    """
+    A util function to assert equality between DataFrame schemas `actual` and `expected`.
+
+    .. versionadded:: 3.5.0
+
+    Parameters
+    ----------
+    actual : StructType
+        The DataFrame schema that is being compared or tested.
+    expected : StructType

Review Comment:
   We can also make it support a DDL string in the future :) 



##########
python/pyspark/testing/utils.py:
##########
@@ -222,6 +223,114 @@ def check_error(
         )
 
 
+def assertSchemaEqual(actual: StructType, expected: StructType):
+    """
+    A util function to assert equality between DataFrame schemas `actual` and `expected`.
+
+    .. versionadded:: 3.5.0
+
+    Parameters
+    ----------
+    actual : StructType
+        The DataFrame schema that is being compared or tested.
+    expected : StructType
+        The expected schema, for comparison with the actual schema.
+
+    Examples
+    --------
+    >>> from pyspark.sql.types import StructType, StructField, ArrayType, IntegerType, DoubleType
+    >>> s1 = StructType([StructField("names", ArrayType(DoubleType(), True), True)])
+    >>> s2 = StructType([StructField("names", ArrayType(DoubleType(), True), True)])
+    >>> assertSchemaEqual(s1, s2)
+
+    Pass, schemas are identical
+
+    >>> s1 = StructType([StructField("names", StructType(
+    ...                                         [StructField("age", DoubleType(), True)]), True)])
+    >>> s2 = StructType([StructField("first name", StructType([StructField(
+    ...                                                      "age", IntegerType(), True)]), True)])
+    >>> assertSchemaEqual(s1, s2)  # fail  # doctest: +IGNORE_EXCEPTION_DETAIL
+    Traceback (most recent call last):
+    ...
+    PySparkAssertionError: [DIFFERENT_SCHEMA] Schemas do not match.
+    The diff below overlays `actual` and `expected` schema tree strings.
+    - indicates a line that should be removed from `actual` to match `expected`.
+    + indicates a line that should be added to `actual` to match `expected`.
+      root
+    -  |-- names: struct (nullable = true)
+    ?          -
+
+    +  |-- first name: struct (nullable = true)
+    ?     ++++++
+
+    -  |    |-- age: double (nullable = true)
+    ?                ^^^^^
+
+    +  |    |-- age: integer (nullable = true)
+    ?                ^^^ +++
+    """
+    if not isinstance(actual, StructType):
+        raise PySparkAssertionError(
+            error_class="UNSUPPORTED_DATA_TYPE",

Review Comment:
   Let's not use the error class UNSUPPORTED_DATA_TYPE here. We can just throw a pyspark type error here and tell users how to make it right (expect a structtype, but got... )



##########
python/pyspark/testing/utils.py:
##########
@@ -222,6 +223,114 @@ def check_error(
         )
 
 
+def assertSchemaEqual(actual: StructType, expected: StructType):
+    """
+    A util function to assert equality between DataFrame schemas `actual` and `expected`.
+
+    .. versionadded:: 3.5.0
+
+    Parameters
+    ----------
+    actual : StructType
+        The DataFrame schema that is being compared or tested.
+    expected : StructType
+        The expected schema, for comparison with the actual schema.
+
+    Examples
+    --------
+    >>> from pyspark.sql.types import StructType, StructField, ArrayType, IntegerType, DoubleType
+    >>> s1 = StructType([StructField("names", ArrayType(DoubleType(), True), True)])
+    >>> s2 = StructType([StructField("names", ArrayType(DoubleType(), True), True)])
+    >>> assertSchemaEqual(s1, s2)
+
+    Pass, schemas are identical
+
+    >>> s1 = StructType([StructField("names", StructType(
+    ...                                         [StructField("age", DoubleType(), True)]), True)])
+    >>> s2 = StructType([StructField("first name", StructType([StructField(
+    ...                                                      "age", IntegerType(), True)]), True)])
+    >>> assertSchemaEqual(s1, s2)  # fail  # doctest: +IGNORE_EXCEPTION_DETAIL
+    Traceback (most recent call last):
+    ...
+    PySparkAssertionError: [DIFFERENT_SCHEMA] Schemas do not match.
+    The diff below overlays `actual` and `expected` schema tree strings.
+    - indicates a line that should be removed from `actual` to match `expected`.
+    + indicates a line that should be added to `actual` to match `expected`.
+      root
+    -  |-- names: struct (nullable = true)
+    ?          -
+
+    +  |-- first name: struct (nullable = true)
+    ?     ++++++
+
+    -  |    |-- age: double (nullable = true)
+    ?                ^^^^^
+
+    +  |    |-- age: integer (nullable = true)
+    ?                ^^^ +++
+    """
+    if not isinstance(actual, StructType):
+        raise PySparkAssertionError(
+            error_class="UNSUPPORTED_DATA_TYPE",
+            message_parameters={"data_type": type(actual)},
+        )
+    if not isinstance(expected, StructType):
+        raise PySparkAssertionError(
+            error_class="UNSUPPORTED_DATA_TYPE",
+            message_parameters={"data_type": type(expected)},
+        )
+
+    from pyspark.sql import SparkSession
+
+    spark = SparkSession.builder.appName("assertSchemaEqual").getOrCreate()
+
+    def compare_schemas_ignore_nullable(s1, s2):
+        if len(s1) != len(s2):
+            return False
+        zipped = zip_longest(s1, s2)
+        for sf1, sf2 in zipped:
+            if not compare_structfields_ignore_nullable(sf1, sf2):
+                return False
+        return True
+
+    def compare_structfields_ignore_nullable(actualSF, expectedSF):
+        if actualSF is None and expectedSF is None:
+            return True
+        elif actualSF is None or expectedSF is None:
+            return False
+        if actualSF.name != expectedSF.name:
+            return False
+        else:
+            return compare_datatypes_ignore_nullable(actualSF.dataType, expectedSF.dataType)
+
+    def compare_datatypes_ignore_nullable(dt1, dt2):

Review Comment:
   nit: can we add type hints for all these functions?  



##########
python/pyspark/testing/utils.py:
##########
@@ -222,6 +223,114 @@ def check_error(
         )
 
 
+def assertSchemaEqual(actual: StructType, expected: StructType):
+    """
+    A util function to assert equality between DataFrame schemas `actual` and `expected`.
+
+    .. versionadded:: 3.5.0
+
+    Parameters
+    ----------
+    actual : StructType
+        The DataFrame schema that is being compared or tested.
+    expected : StructType
+        The expected schema, for comparison with the actual schema.
+
+    Examples
+    --------
+    >>> from pyspark.sql.types import StructType, StructField, ArrayType, IntegerType, DoubleType
+    >>> s1 = StructType([StructField("names", ArrayType(DoubleType(), True), True)])
+    >>> s2 = StructType([StructField("names", ArrayType(DoubleType(), True), True)])
+    >>> assertSchemaEqual(s1, s2)
+
+    Pass, schemas are identical
+
+    >>> s1 = StructType([StructField("names", StructType(
+    ...                                         [StructField("age", DoubleType(), True)]), True)])
+    >>> s2 = StructType([StructField("first name", StructType([StructField(
+    ...                                                      "age", IntegerType(), True)]), True)])

Review Comment:
   nit: indent



##########
python/pyspark/testing/utils.py:
##########
@@ -222,6 +223,114 @@ def check_error(
         )
 
 
+def assertSchemaEqual(actual: StructType, expected: StructType):
+    """
+    A util function to assert equality between DataFrame schemas `actual` and `expected`.
+
+    .. versionadded:: 3.5.0
+
+    Parameters
+    ----------
+    actual : StructType
+        The DataFrame schema that is being compared or tested.
+    expected : StructType
+        The expected schema, for comparison with the actual schema.
+
+    Examples
+    --------
+    >>> from pyspark.sql.types import StructType, StructField, ArrayType, IntegerType, DoubleType
+    >>> s1 = StructType([StructField("names", ArrayType(DoubleType(), True), True)])
+    >>> s2 = StructType([StructField("names", ArrayType(DoubleType(), True), True)])
+    >>> assertSchemaEqual(s1, s2)
+
+    Pass, schemas are identical
+
+    >>> s1 = StructType([StructField("names", StructType(
+    ...                                         [StructField("age", DoubleType(), True)]), True)])
+    >>> s2 = StructType([StructField("first name", StructType([StructField(
+    ...                                                      "age", IntegerType(), True)]), True)])
+    >>> assertSchemaEqual(s1, s2)  # fail  # doctest: +IGNORE_EXCEPTION_DETAIL
+    Traceback (most recent call last):
+    ...
+    PySparkAssertionError: [DIFFERENT_SCHEMA] Schemas do not match.
+    The diff below overlays `actual` and `expected` schema tree strings.
+    - indicates a line that should be removed from `actual` to match `expected`.
+    + indicates a line that should be added to `actual` to match `expected`.
+      root
+    -  |-- names: struct (nullable = true)
+    ?          -
+
+    +  |-- first name: struct (nullable = true)
+    ?     ++++++
+
+    -  |    |-- age: double (nullable = true)
+    ?                ^^^^^
+
+    +  |    |-- age: integer (nullable = true)
+    ?                ^^^ +++
+    """
+    if not isinstance(actual, StructType):
+        raise PySparkAssertionError(
+            error_class="UNSUPPORTED_DATA_TYPE",
+            message_parameters={"data_type": type(actual)},
+        )
+    if not isinstance(expected, StructType):
+        raise PySparkAssertionError(
+            error_class="UNSUPPORTED_DATA_TYPE",
+            message_parameters={"data_type": type(expected)},
+        )
+
+    from pyspark.sql import SparkSession
+
+    spark = SparkSession.builder.appName("assertSchemaEqual").getOrCreate()

Review Comment:
   Hmm, is it necessary to instantiate a Spark session here? Doing so might make this utility function quite heavy. If possible, we should directly check the StructType of the schema, rather than using a Spark session.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org