You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "HyukjinKwon (via GitHub)" <gi...@apache.org> on 2023/10/31 11:35:26 UTC
[PR] [SPARK-45739][PYTHON] Catch IOException instead of EOFException alone for faulthandler [spark]
HyukjinKwon opened a new pull request, #43600:
URL: https://github.com/apache/spark/pull/43600
### What changes were proposed in this pull request?
This PR improves `spark.python.worker.faulthandler.enabled` feature by catching `IOException` instead of `EOFException` (narrower).
### Why are the changes needed?
Exceptions such as `java.net.SocketException: Connection reset` can happen because the worker unexpectedly die. We should better catch all IO exception there.
### Does this PR introduce _any_ user-facing change?
Yes, but only in special cases. When the worker dies unexpectedly during its initialization, this can happen.
### How was this patch tested?
I tested this with Spark Connect:
```bash
./sbin/stop-connect-server.sh$ ./sbin/start-connect-server.sh --conf spark.python.daemon.module=malformed_daemon --conf spark.python.worker.faulthandler.enabled=true --jars `ls connector/connect/server/target/**/spark-connect*SNAPSHOT.jar`
```
```bash
./bin/pyspark --remote "sc://localhost:15002"
```
```python
from pyspark.sql.functions import udf
spark.addArtifact("malformed_daemon.py", pyfile=True)
spark.range(1).select(udf(lambda x: x)("id")).collect()
```
**Before**
```
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 1710, in collect
table, schema = self._to_table()
...
File "/.../spark/python/pyspark/sql/connect/client/core.py", line 1575, in _handle_rpc_error
raise convert_exception(
pyspark.errors.exceptions.connect.SparkConnectGrpcException: (org.apache.spark.SparkException) Job aborted due to stage failure: Task 8 in stage 0.0 failed 1 times, most recent failure: Lost task 8.0 in stage 0.0 (TID 8) (192.168.123.102 executor driver): java.net.SocketException: Connection reset
at
...
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
Driver stacktrace:
JVM stacktrace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 8 in stage 0.0 failed 1 times, most recent failure: Lost task 8.0 in stage 0.0 (TID 8) (192.168.123.102 executor driver): java.net.SocketException: Connection reset
at java.base/sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:394)
at
...
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.lang.Thread.run(Thread.java:833)
```
**After**
```
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 1710, in collect
table, schema = self._to_table()
...
"/.../spark/python/pyspark/sql/connect/client/core.py", line 1575, in _handle_rpc_error
raise convert_exception(
pyspark.errors.exceptions.connect.SparkConnectGrpcException: (org.apache.spark.SparkException) Job aborted due to stage failure: Task 4 in stage 0.0 failed 1 times, most recent failure: Lost task 4.0 in stage 0.0 (TID 4) (192.168.123.102 executor driver): org.apache.spark.SparkException: Python worker exited unexpectedly (crashed): Fatal Python error: Segmentation fault
Current thread 0x00007ff85d338700 (most recent call first):
File "/.../miniconda3/envs/python3.9/lib/python3.9/ctypes/__init__.py", line 525 in string_at
File "/private/var/folders/0c/q8y15ybd3tn7sr2_jmbmftr80000gp/T/spark-397ac42b-c05b-4f50-a6b8-ede30254edc9/userFiles-fd70c41e-46b9-44ed-b781-f8dea10bcb4a/5ce3da24-912a-4207-af82-5dfc8a845714/malformed_daemon.py", line 8 in raise_segfault
File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 1450 in main
...
"/.../miniconda3/envs/python3.9/lib/python3.9/runpy.py", line 197 in _run_module_as_main
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:550)
at
...
java.base/java.io.DataInputStream.readInt(DataInputStream.java:393)
at org.apache.spark.sql.execution.python.BasePythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:92)
... 30 more
Driver stacktrace:
JVM stacktrace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 0.0 failed 1 times, most recent failure: Lost task 4.0 in stage 0.0 (TID 4) (192.168.123.102 executor driver): org.apache.spark.SparkException: Python worker exited unexpectedly (crashed): Fatal Python error: Segmentation fault
Current thread 0x00007ff85d338700 (most recent call first):
File "/.../miniconda3/envs/python3.9/lib/python3.9/ctypes/__init__.py", line 525 in string_at
...
```
### Was this patch authored or co-authored using generative AI tooling?
No.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-45739][PYTHON] Catch IOException instead of EOFException alone for faulthandler [spark]
Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #43600:
URL: https://github.com/apache/spark/pull/43600#issuecomment-1788126056
Merged to master.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-45739][PYTHON] Catch IOException instead of EOFException alone for faulthandler [spark]
Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #43600:
URL: https://github.com/apache/spark/pull/43600#issuecomment-1787488986
Although it looks irrelevant, could you re-trigger the failed PySpark test, @HyukjinKwon ?
> Test: https://github.com/HyukjinKwon/spark/actions/runs/6705864326/job/18221201806
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-45739][PYTHON] Catch IOException instead of EOFException alone for faulthandler [spark]
Posted by "ueshin (via GitHub)" <gi...@apache.org>.
ueshin commented on PR #43600:
URL: https://github.com/apache/spark/pull/43600#issuecomment-1788287681
Late LGTM.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-45739][PYTHON] Catch IOException instead of EOFException alone for faulthandler [spark]
Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #43600:
URL: https://github.com/apache/spark/pull/43600#issuecomment-1787042094
cc @ueshin
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-45739][PYTHON] Catch IOException instead of EOFException alone for faulthandler [spark]
Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #43600:
URL: https://github.com/apache/spark/pull/43600#issuecomment-1787041850
Test: https://github.com/HyukjinKwon/spark/actions/runs/6705864326/job/18221201806
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-45739][PYTHON] Catch IOException instead of EOFException alone for faulthandler [spark]
Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #43600:
URL: https://github.com/apache/spark/pull/43600#issuecomment-1788125916
Retriggered and passed at https://github.com/HyukjinKwon/spark/actions/runs/6705864326/job/18242116209
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-45739][PYTHON] Catch IOException instead of EOFException alone for faulthandler [spark]
Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon closed pull request #43600: [SPARK-45739][PYTHON] Catch IOException instead of EOFException alone for faulthandler
URL: https://github.com/apache/spark/pull/43600
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org