You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "hvanhovell (via GitHub)" <gi...@apache.org> on 2023/07/31 18:28:32 UTC

[GitHub] [spark] hvanhovell opened a new pull request, #42245: [SPARK-29497][CONNECT] Throw error when UDF is not deserializable.

hvanhovell opened a new pull request, #42245:
URL: https://github.com/apache/spark/pull/42245

   ### What changes were proposed in this pull request?
   This PR adds a better error message when a JVM UDF cannot be deserialized.
   
   ### Why are the changes needed?
   In some cases a UDF cannot be deserialized. The happens when a lambda references itself (typically through the capturing class). Java cannot deserialize such an object graph because SerializedLambda's are serialization proxies which need the full graph to be deserialized before they can be transformed into the actual lambda. This is not possible if there is such a cycle. This PR adds a more readable and understandable error when this happens, the original java one is a `ClassCastException` which does not explain or give any hints of what is actually going on.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes. It will throw an error on the client when a UDF is not deserializable. The error is better and more actionable then what we got before.
   
   ### How was this patch tested?
   Added tests.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hvanhovell commented on pull request #42245: [SPARK-29497][CONNECT] Throw error when UDF is not deserializable.

Posted by "hvanhovell (via GitHub)" <gi...@apache.org>.
hvanhovell commented on PR #42245:
URL: https://github.com/apache/spark/pull/42245#issuecomment-1660901012

   Merging. Error is unrelated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on a diff in pull request #42245: [SPARK-29497][CONNECT] Throw error when UDF is not deserializable.

Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on code in PR #42245:
URL: https://github.com/apache/spark/pull/42245#discussion_r1280172761


##########
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala:
##########
@@ -144,6 +146,25 @@ case class ScalarUserDefinedFunction private[sql] (
 }
 
 object ScalarUserDefinedFunction {
+  private val LAMBDA_DESERIALIZATION_ERR_MSG: String =
+    "cannot assign instance of java.lang.invoke.SerializedLambda to field"
+
+  private def checkDeserializable(bytes: Array[Byte]): Unit = {
+    try {
+      SparkSerDeUtils.deserialize(bytes, SparkClassUtils.getContextOrSparkClassLoader)
+    } catch {
+      case e: ClassCastException if e.getMessage.contains(LAMBDA_DESERIALIZATION_ERR_MSG) =>
+        throw new SparkException(
+          "UDF cannot be executed on a Spark cluster: it cannot be deserialized. " +
+            "This is very likely to be caused by the lambda function (the UDF) having a "

Review Comment:
   It seems better to use `"""..."""`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #42245: [SPARK-29497][CONNECT] Throw error when UDF is not deserializable.

Posted by "amaliujia (via GitHub)" <gi...@apache.org>.
amaliujia commented on code in PR #42245:
URL: https://github.com/apache/spark/pull/42245#discussion_r1279822239


##########
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala:
##########
@@ -144,6 +146,25 @@ case class ScalarUserDefinedFunction private[sql] (
 }
 
 object ScalarUserDefinedFunction {
+  private val LAMBDA_DESERIALIZATION_ERR_MSG: String =
+    "cannot assign instance of java.lang.invoke.SerializedLambda to field"
+
+  private def checkDeserializable(bytes: Array[Byte]): Unit = {
+    try {
+      SparkSerDeUtils.deserialize(bytes, SparkClassUtils.getContextOrSparkClassLoader)
+    } catch {
+      case e: ClassCastException if e.getMessage.contains(LAMBDA_DESERIALIZATION_ERR_MSG) =>
+        throw new SparkException(
+          "UDF cannot be executed on a Spark cluster, it cannot be deserialized." +
+            "This is very likely to be caused by the UDF having a self-reference, " +
+            "this is not supported by java serialization.")

Review Comment:
   ```suggestion
           throw new SparkException(
             "UDF cannot be executed on a Spark cluster: it cannot be deserialized." +
               "This is very likely to be caused by the UDF having a self-reference. " +
               "This is not supported by java serialization.")
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on a diff in pull request #42245: [SPARK-29497][CONNECT] Throw error when UDF is not deserializable.

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.
LuciferYang commented on code in PR #42245:
URL: https://github.com/apache/spark/pull/42245#discussion_r1280164462


##########
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala:
##########
@@ -144,6 +146,25 @@ case class ScalarUserDefinedFunction private[sql] (
 }
 
 object ScalarUserDefinedFunction {
+  private val LAMBDA_DESERIALIZATION_ERR_MSG: String =
+    "cannot assign instance of java.lang.invoke.SerializedLambda to field"
+
+  private def checkDeserializable(bytes: Array[Byte]): Unit = {
+    try {
+      SparkSerDeUtils.deserialize(bytes, SparkClassUtils.getContextOrSparkClassLoader)
+    } catch {
+      case e: ClassCastException if e.getMessage.contains(LAMBDA_DESERIALIZATION_ERR_MSG) =>
+        throw new SparkException(
+          "UDF cannot be executed on a Spark cluster: it cannot be deserialized. " +
+            "This is very likely to be caused by the lambda function (the UDF) having a "

Review Comment:
   ```suggestion
               "This is very likely to be caused by the lambda function (the UDF) having a " +
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hvanhovell closed pull request #42245: [SPARK-29497][CONNECT] Throw error when UDF is not deserializable.

Posted by "hvanhovell (via GitHub)" <gi...@apache.org>.
hvanhovell closed pull request #42245: [SPARK-29497][CONNECT] Throw error when UDF is not deserializable.
URL: https://github.com/apache/spark/pull/42245


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on a diff in pull request #42245: [SPARK-29497][CONNECT] Throw error when UDF is not deserializable.

Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on code in PR #42245:
URL: https://github.com/apache/spark/pull/42245#discussion_r1280566782


##########
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala:
##########
@@ -144,6 +146,25 @@ case class ScalarUserDefinedFunction private[sql] (
 }
 
 object ScalarUserDefinedFunction {
+  private val LAMBDA_DESERIALIZATION_ERR_MSG: String =
+    "cannot assign instance of java.lang.invoke.SerializedLambda to field"
+
+  private def checkDeserializable(bytes: Array[Byte]): Unit = {
+    try {
+      SparkSerDeUtils.deserialize(bytes, SparkClassUtils.getContextOrSparkClassLoader)
+    } catch {
+      case e: ClassCastException if e.getMessage.contains(LAMBDA_DESERIALIZATION_ERR_MSG) =>
+        throw new SparkException(
+          "UDF cannot be executed on a Spark cluster: it cannot be deserialized. " +
+            "This is very likely to be caused by the lambda function (the UDF) having a "

Review Comment:
   We can stripe it if we do not need multi-line string. Just for the code style.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #42245: [SPARK-29497][CONNECT] Throw error when UDF is not deserializable.

Posted by "amaliujia (via GitHub)" <gi...@apache.org>.
amaliujia commented on code in PR #42245:
URL: https://github.com/apache/spark/pull/42245#discussion_r1279822549


##########
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala:
##########
@@ -144,6 +146,25 @@ case class ScalarUserDefinedFunction private[sql] (
 }
 
 object ScalarUserDefinedFunction {
+  private val LAMBDA_DESERIALIZATION_ERR_MSG: String =
+    "cannot assign instance of java.lang.invoke.SerializedLambda to field"
+
+  private def checkDeserializable(bytes: Array[Byte]): Unit = {
+    try {
+      SparkSerDeUtils.deserialize(bytes, SparkClassUtils.getContextOrSparkClassLoader)
+    } catch {
+      case e: ClassCastException if e.getMessage.contains(LAMBDA_DESERIALIZATION_ERR_MSG) =>
+        throw new SparkException(
+          "UDF cannot be executed on a Spark cluster, it cannot be deserialized." +
+            "This is very likely to be caused by the UDF having a self-reference, " +
+            "this is not supported by java serialization.")
+      case NonFatal(e) =>
+        throw new SparkException(
+          "UDF cannot be executed on a Spark cluster, it cannot be deserialized.",
+          e)

Review Comment:
   ```suggestion
           throw new SparkException(
             "UDF cannot be executed on a Spark cluster: it cannot be deserialized.",
             e)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hvanhovell commented on a diff in pull request #42245: [SPARK-29497][CONNECT] Throw error when UDF is not deserializable.

Posted by "hvanhovell (via GitHub)" <gi...@apache.org>.
hvanhovell commented on code in PR #42245:
URL: https://github.com/apache/spark/pull/42245#discussion_r1280522452


##########
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala:
##########
@@ -144,6 +146,25 @@ case class ScalarUserDefinedFunction private[sql] (
 }
 
 object ScalarUserDefinedFunction {
+  private val LAMBDA_DESERIALIZATION_ERR_MSG: String =
+    "cannot assign instance of java.lang.invoke.SerializedLambda to field"
+
+  private def checkDeserializable(bytes: Array[Byte]): Unit = {
+    try {
+      SparkSerDeUtils.deserialize(bytes, SparkClassUtils.getContextOrSparkClassLoader)
+    } catch {
+      case e: ClassCastException if e.getMessage.contains(LAMBDA_DESERIALIZATION_ERR_MSG) =>
+        throw new SparkException(
+          "UDF cannot be executed on a Spark cluster: it cannot be deserialized. " +
+            "This is very likely to be caused by the lambda function (the UDF) having a "

Review Comment:
   @beliefer that would create a multi-line string right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org