You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "HyukjinKwon (via GitHub)" <gi...@apache.org> on 2023/10/31 11:35:26 UTC

[PR] [SPARK-45739][PYTHON] Catch IOException instead of EOFException alone for faulthandler [spark]

HyukjinKwon opened a new pull request, #43600:
URL: https://github.com/apache/spark/pull/43600

   ### What changes were proposed in this pull request?
   
   This PR improves `spark.python.worker.faulthandler.enabled` feature by catching `IOException` instead of `EOFException` (narrower).
   
   ### Why are the changes needed?
   
   Exceptions such as `java.net.SocketException: Connection reset` can happen because the worker unexpectedly die. We should better catch all IO exception there.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, but only in special cases. When the worker dies unexpectedly during its initialization, this can happen.
   
   ### How was this patch tested?
   
   I tested this with Spark Connect:
   
   ```bash
   ./sbin/stop-connect-server.sh$ ./sbin/start-connect-server.sh --conf spark.python.daemon.module=malformed_daemon --conf spark.python.worker.faulthandler.enabled=true --jars `ls connector/connect/server/target/**/spark-connect*SNAPSHOT.jar`
   ```
   ```bash
   ./bin/pyspark --remote "sc://localhost:15002"
   ```
   
   ```python
   from pyspark.sql.functions import udf
   spark.addArtifact("malformed_daemon.py", pyfile=True)
   spark.range(1).select(udf(lambda x: x)("id")).collect()
   ```
   
   **Before**
   
   ```
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 1710, in collect
       table, schema = self._to_table()
       ...
     File "/.../spark/python/pyspark/sql/connect/client/core.py", line 1575, in _handle_rpc_error
       raise convert_exception(
   pyspark.errors.exceptions.connect.SparkConnectGrpcException: (org.apache.spark.SparkException) Job aborted due to stage failure: Task 8 in stage 0.0 failed 1 times, most recent failure: Lost task 8.0 in stage 0.0 (TID 8) (192.168.123.102 executor driver): java.net.SocketException: Connection reset
   	at 
         ...
   java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
   	at java.base/java.lang.Thread.run(Thread.java:833)
   
   Driver stacktrace:
   
   JVM stacktrace:
   org.apache.spark.SparkException: Job aborted due to stage failure: Task 8 in stage 0.0 failed 1 times, most recent failure: Lost task 8.0 in stage 0.0 (TID 8) (192.168.123.102 executor driver): java.net.SocketException: Connection reset
   	at java.base/sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:394)
   	at 
       ...
   java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
   	at java.lang.Thread.run(Thread.java:833)
   ```
   
   **After**
   
   ```
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 1710, in collect
       table, schema = self._to_table()
       ... 
   "/.../spark/python/pyspark/sql/connect/client/core.py", line 1575, in _handle_rpc_error
       raise convert_exception(
   pyspark.errors.exceptions.connect.SparkConnectGrpcException: (org.apache.spark.SparkException) Job aborted due to stage failure: Task 4 in stage 0.0 failed 1 times, most recent failure: Lost task 4.0 in stage 0.0 (TID 4) (192.168.123.102 executor driver): org.apache.spark.SparkException: Python worker exited unexpectedly (crashed): Fatal Python error: Segmentation fault
   
   Current thread 0x00007ff85d338700 (most recent call first):
     File "/.../miniconda3/envs/python3.9/lib/python3.9/ctypes/__init__.py", line 525 in string_at
     File "/private/var/folders/0c/q8y15ybd3tn7sr2_jmbmftr80000gp/T/spark-397ac42b-c05b-4f50-a6b8-ede30254edc9/userFiles-fd70c41e-46b9-44ed-b781-f8dea10bcb4a/5ce3da24-912a-4207-af82-5dfc8a845714/malformed_daemon.py", line 8 in raise_segfault
     File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 1450 in main
     ...
   "/.../miniconda3/envs/python3.9/lib/python3.9/runpy.py", line 197 in _run_module_as_main
   
   	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:550)
   	at 
        ...
   java.base/java.io.DataInputStream.readInt(DataInputStream.java:393)
   	at org.apache.spark.sql.execution.python.BasePythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:92)
   	... 30 more
   
   Driver stacktrace:
   
   JVM stacktrace:
   org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 0.0 failed 1 times, most recent failure: Lost task 4.0 in stage 0.0 (TID 4) (192.168.123.102 executor driver): org.apache.spark.SparkException: Python worker exited unexpectedly (crashed): Fatal Python error: Segmentation fault
   
   Current thread 0x00007ff85d338700 (most recent call first):
     File "/.../miniconda3/envs/python3.9/lib/python3.9/ctypes/__init__.py", line 525 in string_at
   ...
   ```
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45739][PYTHON] Catch IOException instead of EOFException alone for faulthandler [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #43600:
URL: https://github.com/apache/spark/pull/43600#issuecomment-1788126056

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45739][PYTHON] Catch IOException instead of EOFException alone for faulthandler [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #43600:
URL: https://github.com/apache/spark/pull/43600#issuecomment-1787488986

   Although it looks irrelevant, could you re-trigger the failed PySpark test, @HyukjinKwon ?
   
   > Test: https://github.com/HyukjinKwon/spark/actions/runs/6705864326/job/18221201806
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45739][PYTHON] Catch IOException instead of EOFException alone for faulthandler [spark]

Posted by "ueshin (via GitHub)" <gi...@apache.org>.
ueshin commented on PR #43600:
URL: https://github.com/apache/spark/pull/43600#issuecomment-1788287681

   Late LGTM.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45739][PYTHON] Catch IOException instead of EOFException alone for faulthandler [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #43600:
URL: https://github.com/apache/spark/pull/43600#issuecomment-1787042094

   cc @ueshin 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45739][PYTHON] Catch IOException instead of EOFException alone for faulthandler [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #43600:
URL: https://github.com/apache/spark/pull/43600#issuecomment-1787041850

   Test: https://github.com/HyukjinKwon/spark/actions/runs/6705864326/job/18221201806


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45739][PYTHON] Catch IOException instead of EOFException alone for faulthandler [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #43600:
URL: https://github.com/apache/spark/pull/43600#issuecomment-1788125916

   Retriggered and passed at https://github.com/HyukjinKwon/spark/actions/runs/6705864326/job/18242116209


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45739][PYTHON] Catch IOException instead of EOFException alone for faulthandler [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon closed pull request #43600: [SPARK-45739][PYTHON] Catch IOException instead of EOFException alone for faulthandler
URL: https://github.com/apache/spark/pull/43600


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org