You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "asl3 (via GitHub)" <gi...@apache.org> on 2023/07/10 15:43:51 UTC

[GitHub] [spark] asl3 opened a new pull request, #41924: Add support for List[Row] data type for expected

asl3 opened a new pull request, #41924:
URL: https://github.com/apache/spark/pull/41924

   ### What changes were proposed in this pull request?
   This PR adds supported for List[Row] type for the `expected` argument in `assertDataFrameEqual`.
   
   ### Why are the changes needed?
   The change improves flexibility of the `assertDataFrameEqual` function by allowing for comparison between dataframe and List[Row] types.
   
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, the PR modifies the user-facing API `assertDataFrameEqual`.
   
   
   ### How was this patch tested?
   Added tests to `runtime/python/pyspark/sql/tests/test_utils.py` and `runtime/python/pyspark/sql/tests/connect/test_utils.py`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon closed pull request #41924: [SPARK-44364] [PYTHON] Add support for List[Row] data type for expected

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon closed pull request #41924: [SPARK-44364] [PYTHON] Add support for List[Row] data type for expected
URL: https://github.com/apache/spark/pull/41924


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] allisonwang-db commented on a diff in pull request #41924: [SPARK-44364] [PYTHON] Add support for List[Row] data type for expected

Posted by "allisonwang-db (via GitHub)" <gi...@apache.org>.
allisonwang-db commented on code in PR #41924:
URL: https://github.com/apache/spark/pull/41924#discussion_r1260237411


##########
python/pyspark/testing/utils.py:
##########
@@ -382,18 +388,23 @@ def assert_rows_equal(rows1: List[Row], rows2: List[Row]):
         try:
             # rename duplicate columns for sorting
             renamed_df = df.toDF(*[f"_{i}" for i in range(len(df.columns))])
-            renamed_expected = expected.toDF(*[f"_{i}" for i in range(len(expected.columns))])
-
             df = renamed_df.sort(renamed_df.columns).toDF(*df.columns)
-            expected = renamed_expected.sort(renamed_expected.columns).toDF(*expected.columns)
+            if isinstance(expected, List):
+                expected.sort()

Review Comment:
   There is a flag  `check_row_orders`. If this flag is False, we will sort both the input data frame and the expected list of Rows. @ueshin do you think it makes more sense to set the default value to be `True` instead?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] allisonwang-db commented on a diff in pull request #41924: [SPARK-44364] [PYTHON] Add support for List[Row] data type for expected

Posted by "allisonwang-db (via GitHub)" <gi...@apache.org>.
allisonwang-db commented on code in PR #41924:
URL: https://github.com/apache/spark/pull/41924#discussion_r1260237411


##########
python/pyspark/testing/utils.py:
##########
@@ -382,18 +388,23 @@ def assert_rows_equal(rows1: List[Row], rows2: List[Row]):
         try:
             # rename duplicate columns for sorting
             renamed_df = df.toDF(*[f"_{i}" for i in range(len(df.columns))])
-            renamed_expected = expected.toDF(*[f"_{i}" for i in range(len(expected.columns))])
-
             df = renamed_df.sort(renamed_df.columns).toDF(*df.columns)
-            expected = renamed_expected.sort(renamed_expected.columns).toDF(*expected.columns)
+            if isinstance(expected, List):
+                expected.sort()

Review Comment:
   There is a flag  `check_row_orders`. If this flag is False, we will sort both the input data frame and the expected list of Rows. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #41924: [SPARK-44364] [PYTHON] Add support for List[Row] data type for expected

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #41924:
URL: https://github.com/apache/spark/pull/41924#discussion_r1262400525


##########
python/pyspark/testing/utils.py:
##########
@@ -235,7 +237,7 @@ def assertDataFrameEqual(df: DataFrame, expected: DataFrame, check_row_order: bo
     df : DataFrame
         The DataFrame that is being compared or tested.
 
-    expected : DataFrame
+    expected : DataFrame or List[Row]

Review Comment:
   ```suggestion
       expected : DataFrame or list of rows
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] allisonwang-db commented on a diff in pull request #41924: [SPARK-44364] [PYTHON] Add support for List[Row] data type for expected

Posted by "allisonwang-db (via GitHub)" <gi...@apache.org>.
allisonwang-db commented on code in PR #41924:
URL: https://github.com/apache/spark/pull/41924#discussion_r1260237411


##########
python/pyspark/testing/utils.py:
##########
@@ -382,18 +388,23 @@ def assert_rows_equal(rows1: List[Row], rows2: List[Row]):
         try:
             # rename duplicate columns for sorting
             renamed_df = df.toDF(*[f"_{i}" for i in range(len(df.columns))])
-            renamed_expected = expected.toDF(*[f"_{i}" for i in range(len(expected.columns))])
-
             df = renamed_df.sort(renamed_df.columns).toDF(*df.columns)
-            expected = renamed_expected.sort(renamed_expected.columns).toDF(*expected.columns)
+            if isinstance(expected, List):
+                expected.sort()

Review Comment:
   There is a flag  `check_row_orders`. If this flag is False, we will sort both the input data frame and the expected list of Rows. @ueshin do you think it makes more sense to set the default value to be `True` instead?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #41924: [SPARK-44364] [PYTHON] Add support for List[Row] data type for expected

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #41924:
URL: https://github.com/apache/spark/pull/41924#discussion_r1262401099


##########
python/pyspark/testing/utils.py:
##########
@@ -221,10 +221,12 @@ def check_error(
         )
 
 
-def assertDataFrameEqual(df: DataFrame, expected: DataFrame, check_row_order: bool = False):
+def assertDataFrameEqual(
+    df: DataFrame, expected: Union[DataFrame, List[Row]], check_row_order: bool = False

Review Comment:
   `check_row_order` -> `checkRowOrder` or `checkOrder`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ueshin commented on a diff in pull request #41924: [SPARK-44364] [PYTHON] Add support for List[Row] data type for expected

Posted by "ueshin (via GitHub)" <gi...@apache.org>.
ueshin commented on code in PR #41924:
URL: https://github.com/apache/spark/pull/41924#discussion_r1260128512


##########
python/pyspark/testing/utils.py:
##########
@@ -382,18 +388,23 @@ def assert_rows_equal(rows1: List[Row], rows2: List[Row]):
         try:
             # rename duplicate columns for sorting
             renamed_df = df.toDF(*[f"_{i}" for i in range(len(df.columns))])
-            renamed_expected = expected.toDF(*[f"_{i}" for i in range(len(expected.columns))])
-
             df = renamed_df.sort(renamed_df.columns).toDF(*df.columns)
-            expected = renamed_expected.sort(renamed_expected.columns).toDF(*expected.columns)
+            if isinstance(expected, List):
+                expected.sort()

Review Comment:
   ```py
   expected = sorted(expected)
   ```
   
   to avoid updating the input.



##########
python/pyspark/testing/utils.py:
##########
@@ -382,18 +388,23 @@ def assert_rows_equal(rows1: List[Row], rows2: List[Row]):
         try:
             # rename duplicate columns for sorting
             renamed_df = df.toDF(*[f"_{i}" for i in range(len(df.columns))])
-            renamed_expected = expected.toDF(*[f"_{i}" for i in range(len(expected.columns))])
-
             df = renamed_df.sort(renamed_df.columns).toDF(*df.columns)
-            expected = renamed_expected.sort(renamed_expected.columns).toDF(*expected.columns)
+            if isinstance(expected, List):
+                expected.sort()

Review Comment:
   Btw, I'm wondering Python's `sort` or `sorted` works as same as `df.sort()`. 
   We might want to ask users to make it already sorted when they specify a list here.
   cc @allisonwang-db 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] allisonwang-db commented on a diff in pull request #41924: [SPARK-44364] [PYTHON] Add support for List[Row] data type for expected

Posted by "allisonwang-db (via GitHub)" <gi...@apache.org>.
allisonwang-db commented on code in PR #41924:
URL: https://github.com/apache/spark/pull/41924#discussion_r1260244209


##########
python/pyspark/testing/utils.py:
##########
@@ -382,18 +388,23 @@ def assert_rows_equal(rows1: List[Row], rows2: List[Row]):
         try:
             # rename duplicate columns for sorting
             renamed_df = df.toDF(*[f"_{i}" for i in range(len(df.columns))])
-            renamed_expected = expected.toDF(*[f"_{i}" for i in range(len(expected.columns))])
-
             df = renamed_df.sort(renamed_df.columns).toDF(*df.columns)
-            expected = renamed_expected.sort(renamed_expected.columns).toDF(*expected.columns)
+            if isinstance(expected, List):
+                expected.sort()

Review Comment:
   We can convert the input data frame to a list and sort both lists of rows.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #41924: [SPARK-44364] [PYTHON] Add support for List[Row] data type for expected

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #41924:
URL: https://github.com/apache/spark/pull/41924#issuecomment-1635051413

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ueshin commented on a diff in pull request #41924: [SPARK-44364] [PYTHON] Add support for List[Row] data type for expected

Posted by "ueshin (via GitHub)" <gi...@apache.org>.
ueshin commented on code in PR #41924:
URL: https://github.com/apache/spark/pull/41924#discussion_r1261840941


##########
python/pyspark/sql/tests/test_utils.py:
##########
@@ -729,46 +729,6 @@ def test_diff_schema_lens(self):
             message_parameters={"df_schema": df1.schema, "expected_schema": df2.schema},
         )
 
-    def test_assert_equal_maptype(self):

Review Comment:
   Shall we keep and update this test to see whether the change on `sort` works for map type or not? If not, we should still show some error.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #41924: [SPARK-44364] [PYTHON] Add support for List[Row] data type for expected

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #41924:
URL: https://github.com/apache/spark/pull/41924#discussion_r1262399738


##########
python/pyspark/testing/utils.py:
##########
@@ -291,7 +293,11 @@ def assertDataFrameEqual(df: DataFrame, expected: DataFrame, check_row_order: bo
                 error_class="UNSUPPORTED_DATA_TYPE",
                 message_parameters={"data_type": type(df)},
             )
-        elif not isinstance(expected, DataFrame) and not isinstance(expected, ConnectDataFrame):
+        elif (
+            not isinstance(expected, DataFrame)
+            and not isinstance(expected, ConnectDataFrame)
+            and not isinstance(expected, List)

Review Comment:
   ```suggestion
               and not isinstance(expected, list)
   ```
   
   List is for type hint.



##########
python/pyspark/testing/utils.py:
##########
@@ -291,7 +293,11 @@ def assertDataFrameEqual(df: DataFrame, expected: DataFrame, check_row_order: bo
                 error_class="UNSUPPORTED_DATA_TYPE",
                 message_parameters={"data_type": type(df)},
             )
-        elif not isinstance(expected, DataFrame) and not isinstance(expected, ConnectDataFrame):
+        elif (
+            not isinstance(expected, DataFrame)
+            and not isinstance(expected, ConnectDataFrame)
+            and not isinstance(expected, List)

Review Comment:
   ```suggestion
               and not isinstance(expected, list)
   ```
   
   List is for type hint.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org