You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by sr...@apache.org on 2023/03/14 13:30:51 UTC
[spark] branch master updated: [SPARK-42752][PYSPARK][SQL] Make PySpark exceptions printable during initialization
This is an automated email from the ASF dual-hosted git repository.
srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new b2a7f14cbd8 [SPARK-42752][PYSPARK][SQL] Make PySpark exceptions printable during initialization
b2a7f14cbd8 is described below
commit b2a7f14cbd8fd3b1a51d7b53fc7c23fb71e9f370
Author: Gera Shegalov <ge...@apache.org>
AuthorDate: Tue Mar 14 08:30:15 2023 -0500
[SPARK-42752][PYSPARK][SQL] Make PySpark exceptions printable during initialization
Ignore SQLConf initialization exceptions during Python exception creation.
Otherwise there is no diagnostics for the issue in the following scenario:
1. download a standard "Hadoop Free" build
2. Start PySpark REPL with Hive support
```bash
SPARK_DIST_CLASSPATH=$(~/dist/hadoop-3.4.0-SNAPSHOT/bin/hadoop classpath) \
~/dist/spark-3.2.3-bin-without-hadoop/bin/pyspark --conf spark.sql.catalogImplementation=hive
```
3. Execute any simple dataframe operation
```Python
>>> spark.range(100).show()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/session.py", line 416, in range
jdf = self._jsparkSession.range(0, int(start), int(step), int(numPartitions))
File "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
File "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/utils.py", line 117, in deco
raise converted from None
pyspark.sql.utils.IllegalArgumentException: <exception str() failed>
```
4. In fact just spark.conf already exhibits the issue
```Python
>>> spark.conf
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/session.py", line 347, in conf
self._conf = RuntimeConfig(self._jsparkSession.conf())
File "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
File "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/utils.py", line 117, in deco
raise converted from None
pyspark.sql.utils.IllegalArgumentException: <exception str() failed>
```
There are probably two issues here:
1) that Hive support should be gracefully disabled if it the dependency not on the classpath as claimed by https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html
2) but at the very least the user should be able to see the exception to understand the issue, and take an action
### What changes were proposed in this pull request?
Ignore exceptions during `CapturedException` creation
### Why are the changes needed?
To make the cause visible to the user
```Python
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/user/gits/apache/spark/python/pyspark/sql/session.py", line 679, in conf
self._conf = RuntimeConfig(self._jsparkSession.conf())
File "/home/user/gits/apache/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
File "/home/user/gits/apache/spark/python/pyspark/errors/exceptions/captured.py", line 166, in deco
raise converted from None
pyspark.errors.exceptions.captured.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':
JVM stacktrace:
java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':
at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1237)
at org.apache.spark.sql.SparkSession.$anonfun$sessionState$2(SparkSession.scala:162)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:160)
at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:157)
at org.apache.spark.sql.SparkSession.conf$lzycompute(SparkSession.scala:185)
at org.apache.spark.sql.SparkSession.conf(SparkSession.scala:185)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.hive.HiveSessionStateBuilder
at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:225)
at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1232)
... 18 more
```
### Does this PR introduce _any_ user-facing change?
The only semantic change is that the conf `spark.sql.pyspark.jvmStacktrace.enabled` is ignored if the SQLConf is broken.
### How was this patch tested?
Manual testing using the repro steps above
Closes #40372 from gerashegalov/SPARK-42752.
Authored-by: Gera Shegalov <ge...@apache.org>
Signed-off-by: Sean Owen <sr...@gmail.com>
---
python/pyspark/errors/exceptions/captured.py | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/python/pyspark/errors/exceptions/captured.py b/python/pyspark/errors/exceptions/captured.py
index 1764ed7d02c..6313665b3fe 100644
--- a/python/pyspark/errors/exceptions/captured.py
+++ b/python/pyspark/errors/exceptions/captured.py
@@ -65,8 +65,15 @@ class CapturedException(PySparkException):
assert SparkContext._jvm is not None
jvm = SparkContext._jvm
- sql_conf = jvm.org.apache.spark.sql.internal.SQLConf.get()
- debug_enabled = sql_conf.pysparkJVMStacktraceEnabled()
+
+ # SPARK-42752: default to True to see issues with initialization
+ debug_enabled = True
+ try:
+ sql_conf = jvm.org.apache.spark.sql.internal.SQLConf.get()
+ debug_enabled = sql_conf.pysparkJVMStacktraceEnabled()
+ except BaseException:
+ pass
+
desc = self.desc
if debug_enabled:
desc = desc + "\n\nJVM stacktrace:\n%s" % self.stackTrace
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org