You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "gerashegalov (via GitHub)" <gi...@apache.org> on 2023/03/10 20:39:39 UTC

[GitHub] [spark] gerashegalov opened a new pull request, #40372: [SPARK-42752][PYSPARK][SQL] Make PySpark exceptions printable during initialization

gerashegalov opened a new pull request, #40372:
URL: https://github.com/apache/spark/pull/40372

   Ignore SQLConf initialization exceptions during Python exception creation.
   
   Otherwise there is no diagnostics for the issue in the following scenario:
   
   1. download a standard "Hadoop Free" build
   2. Start PySpark REPL with Hive support
   ```bash
   SPARK_DIST_CLASSPATH=$(~/dist/hadoop-3.4.0-SNAPSHOT/bin/hadoop classpath) \
     ~/dist/spark-3.2.3-bin-without-hadoop/bin/pyspark --conf spark.sql.catalogImplementation=hive
   ```
   3. Execute any simple dataframe operation
   ```Python
   >>> spark.range(100).show()
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/session.py", line 416, in range
       jdf = self._jsparkSession.range(0, int(start), int(step), int(numPartitions))
     File "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
     File "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/utils.py", line 117, in deco
       raise converted from None
   pyspark.sql.utils.IllegalArgumentException: <exception str() failed>
   ```
   4. In fact just spark.conf already exhibits the issue
   ```Python
   >>> spark.conf
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/session.py", line 347, in conf
       self._conf = RuntimeConfig(self._jsparkSession.conf())
     File "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
     File "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/utils.py", line 117, in deco
       raise converted from None
   pyspark.sql.utils.IllegalArgumentException: <exception str() failed>
   ```
   
   There are probably two issues here:
   1) that Hive support should be gracefully disabled if it the dependency not on the classpath as claimed by https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html
   2) but at the very least the user should be able to see the exception to understand the issue, and take an action
   
   ### What changes were proposed in this pull request?
   
   Ignore exceptions during `CapturedException` creation
   
   ### Why are the changes needed?
   To make the cause visible to the user
   
   ```Python
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/home/user/gits/apache/spark/python/pyspark/sql/session.py", line 679, in conf
       self._conf = RuntimeConfig(self._jsparkSession.conf())
     File "/home/user/gits/apache/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
     File "/home/user/gits/apache/spark/python/pyspark/errors/exceptions/captured.py", line 166, in deco
       raise converted from None
   pyspark.errors.exceptions.captured.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':
   
   JVM stacktrace:
   java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':
           at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1237)
           at org.apache.spark.sql.SparkSession.$anonfun$sessionState$2(SparkSession.scala:162)
           at scala.Option.getOrElse(Option.scala:189)
           at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:160)
           at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:157)
           at org.apache.spark.sql.SparkSession.conf$lzycompute(SparkSession.scala:185)
           at org.apache.spark.sql.SparkSession.conf(SparkSession.scala:185)
           at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
           at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
           at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
           at java.lang.reflect.Method.invoke(Method.java:498)
           at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
           at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
           at py4j.Gateway.invoke(Gateway.java:282)
           at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
           at py4j.commands.CallCommand.execute(CallCommand.java:79)
           at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
           at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
           at java.lang.Thread.run(Thread.java:750)
   Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.hive.HiveSessionStateBuilder
           at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
           at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
           at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
           at java.lang.Class.forName0(Native Method)
           at java.lang.Class.forName(Class.java:348)
           at org.apache.spark.util.Utils$.classForName(Utils.scala:225)
           at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1232)
           ... 18 more
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   The only semantic change is that the conf `spark.sql.pyspark.jvmStacktrace.enabled` is ignored if the SQLConf is broken. 
   
   ### How was this patch tested?
   Manual testing using the repro steps above


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen closed pull request #40372: [SPARK-42752][PYSPARK][SQL] Make PySpark exceptions printable during initialization

Posted by "srowen (via GitHub)" <gi...@apache.org>.

srowen closed pull request #40372: [SPARK-42752][PYSPARK][SQL] Make PySpark exceptions printable during initialization
URL: https://github.com/apache/spark/pull/40372


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #40372: [SPARK-42752][PYSPARK][SQL] Make PySpark exceptions printable during initialization

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #40372:
URL: https://github.com/apache/spark/pull/40372#discussion_r1133543656


##########
python/pyspark/errors/exceptions/captured.py:
##########
@@ -65,8 +65,15 @@ def __str__(self) -> str:
         assert SparkContext._jvm is not None
 
         jvm = SparkContext._jvm
-        sql_conf = jvm.org.apache.spark.sql.internal.SQLConf.get()
-        debug_enabled = sql_conf.pysparkJVMStacktraceEnabled()
+
+        # SPARK-42752: default to True to see issues with initialization
+        debug_enabled = True
+        try:
+            sql_conf = jvm.org.apache.spark.sql.internal.SQLConf.get()
+            debug_enabled = sql_conf.pysparkJVMStacktraceEnabled()
+        except BaseException:

Review Comment:
   Maybe can we capture the Exception more specifically for `pyspark.sql.utils.IllegalArgumentException` instead of capturing `BaseException` ??



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #40372: [SPARK-42752][PYSPARK][SQL] Make PySpark exceptions printable during initialization

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #40372:
URL: https://github.com/apache/spark/pull/40372#discussion_r1136490675


##########
python/pyspark/errors/exceptions/captured.py:
##########
@@ -65,8 +65,15 @@ def __str__(self) -> str:
         assert SparkContext._jvm is not None
 
         jvm = SparkContext._jvm
-        sql_conf = jvm.org.apache.spark.sql.internal.SQLConf.get()
-        debug_enabled = sql_conf.pysparkJVMStacktraceEnabled()
+
+        # SPARK-42752: default to True to see issues with initialization
+        debug_enabled = True
+        try:
+            sql_conf = jvm.org.apache.spark.sql.internal.SQLConf.get()
+            debug_enabled = sql_conf.pysparkJVMStacktraceEnabled()
+        except BaseException:

Review Comment:
   Sounds good. Thanks for working on this!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #40372: [SPARK-42752][PYSPARK][SQL] Make PySpark exceptions printable during initialization

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.

zhengruifeng commented on PR #40372:
URL: https://github.com/apache/spark/pull/40372#issuecomment-1465459996

   cc @itholic @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on pull request #40372: [SPARK-42752][PYSPARK][SQL] Make PySpark exceptions printable during initialization

Posted by "srowen (via GitHub)" <gi...@apache.org>.

srowen commented on PR #40372:
URL: https://github.com/apache/spark/pull/40372#issuecomment-1468108487

   Merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] gerashegalov commented on a diff in pull request #40372: [SPARK-42752][PYSPARK][SQL] Make PySpark exceptions printable during initialization

Posted by "gerashegalov (via GitHub)" <gi...@apache.org>.

gerashegalov commented on code in PR #40372:
URL: https://github.com/apache/spark/pull/40372#discussion_r1134801751


##########
python/pyspark/errors/exceptions/captured.py:
##########
@@ -65,8 +65,15 @@ def __str__(self) -> str:
         assert SparkContext._jvm is not None
 
         jvm = SparkContext._jvm
-        sql_conf = jvm.org.apache.spark.sql.internal.SQLConf.get()
-        debug_enabled = sql_conf.pysparkJVMStacktraceEnabled()
+
+        # SPARK-42752: default to True to see issues with initialization
+        debug_enabled = True
+        try:
+            sql_conf = jvm.org.apache.spark.sql.internal.SQLConf.get()
+            debug_enabled = sql_conf.pysparkJVMStacktraceEnabled()
+        except BaseException:

Review Comment:
   I advocate for keeping the likelihood of an unhelpful unprintable exception during initialization to the minimum. I would not want to revisit the issue for other runtime exceptions.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] gerashegalov commented on pull request #40372: [SPARK-42752][PYSPARK][SQL] Make PySpark exceptions printable during initialization

Posted by "gerashegalov (via GitHub)" <gi...@apache.org>.

gerashegalov commented on PR #40372:
URL: https://github.com/apache/spark/pull/40372#issuecomment-1474472321

   Thanks for reviews and merging.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org