You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "zhengruifeng (via GitHub)" <gi...@apache.org> on 2023/05/11 02:36:45 UTC

[GitHub] [spark] zhengruifeng opened a new pull request, #41127: [SPARK-43442][PYTHON][CONNECT][TESTS] Split test module `pyspark_pandas_connect`

zhengruifeng opened a new pull request, #41127:
URL: https://github.com/apache/spark/pull/41127

   ### What changes were proposed in this pull request?
   Split test module `pyspark_pandas_connect`.
   
   Add a new module `pyspark_pandas_slow_connect` which should keep in line with `pyspark_pandas_slow`
   
   ### Why are the changes needed?
   `pyspark_pandas_connect` may take 3~4 hours
   
   ### Does this PR introduce _any_ user-facing change?
   No, test-only
   
   ### How was this patch tested?
   updated CI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #41127: [WIP][SPARK-43442][PS][CONNECT][TESTS] Split test module `pyspark_pandas_connect`

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on code in PR #41127:
URL: https://github.com/apache/spark/pull/41127#discussion_r1192717777


##########
dev/sparktestsupport/modules.py:
##########
@@ -852,6 +852,22 @@ def __hash__(self):
         "pyspark.pandas.tests.connect.test_parity_utils",
         "pyspark.pandas.tests.connect.test_parity_window",
         "pyspark.pandas.tests.connect.indexes.test_parity_base",
+    ],
+    excluded_python_implementations=[
+        "PyPy"  # Skip these tests under PyPy since they require numpy, pandas, and pyarrow and
+        # they aren't available there
+    ],
+)
+
+
+pyspark_pandas_slow_connect = Module(

Review Comment:
   Oh, I thought we add `_slow` as postfix.
   
   So, the rational is to have the same test list with `pyspark_pandas_slow`?
   
   If then, +1.



##########
dev/sparktestsupport/modules.py:
##########
@@ -852,6 +852,22 @@ def __hash__(self):
         "pyspark.pandas.tests.connect.test_parity_utils",
         "pyspark.pandas.tests.connect.test_parity_window",
         "pyspark.pandas.tests.connect.indexes.test_parity_base",
+    ],
+    excluded_python_implementations=[
+        "PyPy"  # Skip these tests under PyPy since they require numpy, pandas, and pyarrow and
+        # they aren't available there
+    ],
+)
+
+
+pyspark_pandas_slow_connect = Module(

Review Comment:
   Please add some comments about the relationship.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #41127: [SPARK-43442][PS][CONNECT][TESTS] Split test module `pyspark_pandas_connect`

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on PR #41127:
URL: https://github.com/apache/spark/pull/41127#issuecomment-1547153412

   In the [latest run](https://github.com/zhengruifeng/spark/actions/runs/4975618218/jobs/8902974576):
   
   the `pyspark-pandas-connnect` took 1h 25m 21s, and `pyspark-pandas-slow-connnect` took 1h 45m 22s, so I think the split itself should be fine.
   
   but still see :
   
   ```
   Starting test(python3.9): pyspark.pandas.tests.test_ops_on_diff_frames (temp output: /__w/spark/spark/python/target/bcf91e64-da92-456c-a65f-2acbb5a57c5e/python3.9__pyspark.pandas.tests.test_ops_on_diff_frames__vhzam369.log)
   Finished test(python3.9): pyspark.pandas.tests.test_ops_on_diff_frames (183s)
   Starting test(python3.9): pyspark.pandas.tests.test_ops_on_diff_frames_groupby (temp output: /__w/spark/spark/python/target/21464771-1bb4-4e2e-8a76-cc5424bc3eee/python3.9__pyspark.pandas.tests.test_ops_on_diff_frames_groupby__3jskgr8q.log)
   Finished test(python3.9): pyspark.pandas.tests.test_ops_on_diff_frames_groupby (121s)
   Starting test(python3.9): pyspark.pandas.tests.test_ops_on_diff_frames_slow (temp output: /__w/spark/spark/python/target/4ec97d5c-ff3a-474f-aaa4-a3e08b503667/python3.9__pyspark.pandas.tests.test_ops_on_diff_frames_slow__bfsd6ob8.log)
   Finished test(python3.9): pyspark.pandas.tests.test_ops_on_diff_frames_slow (167s)
   ```
   
   
   ```
   Starting test(python3.9): pyspark.pandas.tests.connect.test_parity_ops_on_diff_frames (temp output: /__w/spark/spark/python/target/7d09fc80-b92b-4872-ae33-62f61aa4d2d9/python3.9__pyspark.pandas.tests.connect.test_parity_ops_on_diff_frames__3d49xqjs.log)
   Finished test(python3.9): pyspark.pandas.tests.connect.test_parity_ops_on_diff_frames (1018s)
   Starting test(python3.9): pyspark.pandas.tests.connect.test_parity_ops_on_diff_frames_groupby (temp output: /__w/spark/spark/python/target/6542f9a4-39fb-4605-9e7b-6d9a3e7b5a63/python3.9__pyspark.pandas.tests.connect.test_parity_ops_on_diff_frames_groupby__swoyoc14.log)
   Finished test(python3.9): pyspark.pandas.tests.connect.test_parity_ops_on_diff_frames_groupby (350s) ... 8 tests were skipped
   Starting test(python3.9): pyspark.pandas.tests.connect.test_parity_ops_on_diff_frames_slow (temp output: /__w/spark/spark/python/target/7880eb8c-9299-4e1c-af32-daa48aa10715/python3.9__pyspark.pandas.tests.connect.test_parity_ops_on_diff_frames_slow__y24aawsj.log)
   Finished test(python3.9): pyspark.pandas.tests.connect.test_parity_ops_on_diff_frames_slow (732s) ... 7 tests were skipped
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #41127: [WIP][SPARK-43442][PS][CONNECT][TESTS] Split test module `pyspark_pandas_connect`

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on code in PR #41127:
URL: https://github.com/apache/spark/pull/41127#discussion_r1192716383


##########
.github/workflows/build_and_test.yml:
##########
@@ -344,6 +344,8 @@ jobs:
             pyspark-connect, pyspark-errors
           - >-
             pyspark-pandas-connect
+          - >-
+            pyspark-pandas-slow-connect

Review Comment:
   `pyspark-pandas-slow-connect` -> `pyspark-pandas-connect-slow`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinrong-meng commented on pull request #41127: [WIP][SPARK-43442][PS][CONNECT][TESTS] Split test module `pyspark_pandas_connect`

Posted by "xinrong-meng (via GitHub)" <gi...@apache.org>.
xinrong-meng commented on PR #41127:
URL: https://github.com/apache/spark/pull/41127#issuecomment-1544366612

   > It seems that `test_ops_on_diff_frames_*` is much slower atop Connect
   
   Interesting.. May I ask if CI jobs run those Connect tests on a single node?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #41127: [SPARK-43442][PYTHON][CONNECT][TESTS] Split test module `pyspark_pandas_connect`

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on code in PR #41127:
URL: https://github.com/apache/spark/pull/41127#discussion_r1190578407


##########
dev/sparktestsupport/modules.py:
##########
@@ -852,6 +852,22 @@ def __hash__(self):
         "pyspark.pandas.tests.connect.test_parity_utils",
         "pyspark.pandas.tests.connect.test_parity_window",
         "pyspark.pandas.tests.connect.indexes.test_parity_base",
+    ],
+    excluded_python_implementations=[
+        "PyPy"  # Skip these tests under PyPy since they require numpy, pandas, and pyarrow and
+        # they aren't available there
+    ],
+)
+
+
+pyspark_pandas_slow_connect = Module(

Review Comment:
   the new module should be consistent with [pyspark_pandas_slow](https://github.com/apache/spark/blob/master/dev/sparktestsupport/modules.py#L706-L733) 
   
   to make it easy to maintain



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #41127: [WIP][SPARK-43442][PS][CONNECT][TESTS] Split test module `pyspark_pandas_connect`

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on PR #41127:
URL: https://github.com/apache/spark/pull/41127#issuecomment-1547061233

   > > It seems that `test_ops_on_diff_frames_*` is much slower atop Connect
   > 
   > Interesting.. May I ask if CI jobs run those Connect tests on a single node?
   
   yes, all tests are on single node. I am afraid there maybe some performance issue in pandas on spark on connect


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #41127: [SPARK-43442][PS][CONNECT][TESTS] Split test module `pyspark_pandas_connect`

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on PR #41127:
URL: https://github.com/apache/spark/pull/41127#issuecomment-1548749738

   thank you all for the reviews, @dongjoon-hyun @HyukjinKwon @xinrong-meng @itholic 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #41127: [WIP][SPARK-43442][PS][CONNECT][TESTS] Split test module `pyspark_pandas_connect`

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on code in PR #41127:
URL: https://github.com/apache/spark/pull/41127#discussion_r1193245764


##########
dev/sparktestsupport/modules.py:
##########
@@ -852,6 +852,22 @@ def __hash__(self):
         "pyspark.pandas.tests.connect.test_parity_utils",
         "pyspark.pandas.tests.connect.test_parity_window",
         "pyspark.pandas.tests.connect.indexes.test_parity_base",
+    ],
+    excluded_python_implementations=[
+        "PyPy"  # Skip these tests under PyPy since they require numpy, pandas, and pyarrow and
+        # they aren't available there
+    ],
+)
+
+
+pyspark_pandas_slow_connect = Module(

Review Comment:
   yeah, let me add comments here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #41127: [WIP][SPARK-43442][PS][CONNECT][TESTS] Split test module `pyspark_pandas_connect`

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on PR #41127:
URL: https://github.com/apache/spark/pull/41127#issuecomment-1543676556

   It seems that `test_ops_on_diff_frames_*` is much slower in Connect
   
   ```
   Starting test(python3.9): pyspark.pandas.tests.test_ops_on_diff_frames (temp output: /__w/spark/spark/python/target/e0c7d560-dd29-408d-9f7e-834cd2cc683a/python3.9__pyspark.pandas.tests.test_ops_on_diff_frames__j1jfu3kt.log)
   Finished test(python3.9): pyspark.pandas.tests.test_ops_on_diff_frames (177s)
   Starting test(python3.9): pyspark.pandas.tests.test_ops_on_diff_frames_groupby (temp output: /__w/spark/spark/python/target/bff4fd9b-cf0b-4847-bc19-08114b9b106c/python3.9__pyspark.pandas.tests.test_ops_on_diff_frames_groupby__0fb3fdq5.log)
   Finished test(python3.9): pyspark.pandas.tests.test_ops_on_diff_frames_groupby (116s)
   Starting test(python3.9): pyspark.pandas.tests.test_ops_on_diff_frames_slow (temp output: /__w/spark/spark/python/target/89a994de-9f04-4ed8-aa77-16d95b79956e/python3.9__pyspark.pandas.tests.test_ops_on_diff_frames_slow__zjjj9b5a.log)
   Finished test(python3.9): pyspark.pandas.tests.test_ops_on_diff_frames_slow (163s)
   ```
   
   ```
   Starting test(python3.9): pyspark.pandas.tests.connect.test_parity_ops_on_diff_frames (temp output: /__w/spark/spark/python/target/80844664-a5e0-4685-804e-8eff938c6681/python3.9__pyspark.pandas.tests.connect.test_parity_ops_on_diff_frames__mdlsorlh.log)
   Finished test(python3.9): pyspark.pandas.tests.connect.test_parity_ops_on_diff_frames (1191s)
   Starting test(python3.9): pyspark.pandas.tests.connect.test_parity_ops_on_diff_frames_groupby (temp output: /__w/spark/spark/python/target/bb5e908e-9185-46ac-826f-a2055f247f43/python3.9__pyspark.pandas.tests.connect.test_parity_ops_on_diff_frames_groupby__50hudnbv.log)
   Finished test(python3.9): pyspark.pandas.tests.connect.test_parity_ops_on_diff_frames_groupby (443s) ... 8 tests were skipped
   Starting test(python3.9): pyspark.pandas.tests.connect.test_parity_ops_on_diff_frames_slow (temp output: /__w/spark/spark/python/target/a20e748b-8b2d-493c-b926-ade588f0cb7a/python3.9__pyspark.pandas.tests.connect.test_parity_ops_on_diff_frames_slow__c78c3kwt.log)
   Finished test(python3.9): pyspark.pandas.tests.connect.test_parity_ops_on_diff_frames_slow (911s) ... 7 tests were skipped
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #41127: [SPARK-43442][PYTHON][CONNECT][TESTS] Split test module `pyspark_pandas_connect`

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on code in PR #41127:
URL: https://github.com/apache/spark/pull/41127#discussion_r1190578407


##########
dev/sparktestsupport/modules.py:
##########
@@ -852,6 +852,22 @@ def __hash__(self):
         "pyspark.pandas.tests.connect.test_parity_utils",
         "pyspark.pandas.tests.connect.test_parity_window",
         "pyspark.pandas.tests.connect.indexes.test_parity_base",
+    ],
+    excluded_python_implementations=[
+        "PyPy"  # Skip these tests under PyPy since they require numpy, pandas, and pyarrow and
+        # they aren't available there
+    ],
+)
+
+
+pyspark_pandas_slow_connect = Module(

Review Comment:
   the new module should be consistent with https://github.com/apache/spark/blob/master/dev/sparktestsupport/modules.py#L706-L733
   
   to make it easy to maintain



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #41127: [SPARK-43442][PYTHON][CONNECT][TESTS] Split test module `pyspark_pandas_connect`

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on code in PR #41127:
URL: https://github.com/apache/spark/pull/41127#discussion_r1190578407


##########
dev/sparktestsupport/modules.py:
##########
@@ -852,6 +852,22 @@ def __hash__(self):
         "pyspark.pandas.tests.connect.test_parity_utils",
         "pyspark.pandas.tests.connect.test_parity_window",
         "pyspark.pandas.tests.connect.indexes.test_parity_base",
+    ],
+    excluded_python_implementations=[
+        "PyPy"  # Skip these tests under PyPy since they require numpy, pandas, and pyarrow and
+        # they aren't available there
+    ],
+)
+
+
+pyspark_pandas_slow_connect = Module(

Review Comment:
   the new module should be consistent with https://github.com/apache/spark/blob/master/dev/sparktestsupport/modules.py#L706-L733



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #41127: [WIP][SPARK-43442][PS][CONNECT][TESTS] Split test module `pyspark_pandas_connect`

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on code in PR #41127:
URL: https://github.com/apache/spark/pull/41127#discussion_r1192716383


##########
.github/workflows/build_and_test.yml:
##########
@@ -344,6 +344,8 @@ jobs:
             pyspark-connect, pyspark-errors
           - >-
             pyspark-pandas-connect
+          - >-
+            pyspark-pandas-slow-connect

Review Comment:
   ~`pyspark-pandas-slow-connect` -> `pyspark-pandas-connect-slow`~



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #41127: [SPARK-43442][PS][CONNECT][TESTS] Split test module `pyspark_pandas_connect`

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #41127:
URL: https://github.com/apache/spark/pull/41127#issuecomment-1548298138

   Thank you so much, @zhengruifeng , @HyukjinKwon , @xinrong-meng , @itholic !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun closed pull request #41127: [SPARK-43442][PS][CONNECT][TESTS] Split test module `pyspark_pandas_connect`

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun closed pull request #41127: [SPARK-43442][PS][CONNECT][TESTS] Split test module `pyspark_pandas_connect`
URL: https://github.com/apache/spark/pull/41127


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org