You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "juliuszsompolski (via GitHub)" <gi...@apache.org> on 2023/09/05 16:42:08 UTC

[GitHub] [spark] juliuszsompolski opened a new pull request, #42818: [SPARK-44835] Make INVALID_CURSOR.DISCONNECTED a retriable error

juliuszsompolski opened a new pull request, #42818:
URL: https://github.com/apache/spark/pull/42818

   ### What changes were proposed in this pull request?
   
   Make INVALID_CURSOR.DISCONNECTED a retriable error.
   
   ### Why are the changes needed?
   
   This error can happen if two RPCs are racing to reattach to the query, and the client is still using the losing one. SPARK-44833 was a bug that exposed such a situation. That was fixed, but to be more robust, we can make this error retryable.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Tests will be added in https://github.com/apache/spark/pull/42560
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on a diff in pull request #42818: [SPARK-44835] Make INVALID_CURSOR.DISCONNECTED a retriable error

Posted by "grundprinzip (via GitHub)" <gi...@apache.org>.
grundprinzip commented on code in PR #42818:
URL: https://github.com/apache/spark/pull/42818#discussion_r1317007141


##########
python/pyspark/sql/connect/client/core.py:
##########
@@ -585,11 +585,39 @@ class SparkConnectClient(object):
 
     @classmethod
     def retry_exception(cls, e: Exception) -> bool:
-        if isinstance(e, grpc.RpcError):
-            return e.code() == grpc.StatusCode.UNAVAILABLE
-        else:
+        """
+        Helper function that is used to identify if an exception thrown by the server
+        can be retried or not.
+
+        Parameters
+        ----------
+        e : Exception
+            The GRPC error as received from the server. Typed as Exception, because other exception
+            thrown during client processing can be passed here as well.
+
+        Returns
+        -------
+        True if the exception can be retried, False otherwise.
+
+        """
+        if not isinstance(e, grpc.RpcError):
             return False
 
+        if e.code() in [
+            grpc.StatusCode.INTERNAL
+        ]:
+            # This error happens if another RPC preempts this RPC, retry.
+            msg_cursor_disconnected = "INVALID_CURSOR.DISCONNECTED"
+
+            msg = str(e)
+            if any(map(lambda x: x in msg, [msg_cursor_disconnected])):
+                return True

Review Comment:
   We can do this in a follow up.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] juliuszsompolski commented on pull request #42818: [SPARK-44835] Make INVALID_CURSOR.DISCONNECTED a retriable error

Posted by "juliuszsompolski (via GitHub)" <gi...@apache.org>.
juliuszsompolski commented on PR #42818:
URL: https://github.com/apache/spark/pull/42818#issuecomment-1708624411

   https://github.com/juliuszsompolski/apache-spark/actions/runs/6097172856/job/16544325699
   ```
   [info] *** 1 TEST FAILED ***
   [error] Failed tests:
   [error] 	org.apache.spark.deploy.k8s.integrationtest.VolcanoSuite
   ```
   flake
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #42818: [SPARK-44835][CONNECT] Make INVALID_CURSOR.DISCONNECTED a retriable error

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #42818:
URL: https://github.com/apache/spark/pull/42818#issuecomment-1709345232

   Merged to master and branch-3.5.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hvanhovell commented on a diff in pull request #42818: [SPARK-44835] Make INVALID_CURSOR.DISCONNECTED a retriable error

Posted by "hvanhovell (via GitHub)" <gi...@apache.org>.
hvanhovell commented on code in PR #42818:
URL: https://github.com/apache/spark/pull/42818#discussion_r1317537852


##########
connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/GrpcRetryHandler.scala:
##########
@@ -217,7 +217,23 @@ private[sql] object GrpcRetryHandler extends Logging {
    */
   private[client] def retryException(e: Throwable): Boolean = {
     e match {
-      case e: StatusRuntimeException => e.getStatus.getCode == Status.Code.UNAVAILABLE
+      case e: StatusRuntimeException =>
+        val statusCode: Status.Code = e.getStatus.getCode
+
+        if (List(Status.Code.INTERNAL)

Review Comment:
   I am surely missing something here, but isn't this equivalent to `statusCode: == Status.Code.INTERNAL`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] juliuszsompolski commented on pull request #42818: [SPARK-44835] Make INVALID_CURSOR.DISCONNECTED a retriable error

Posted by "juliuszsompolski (via GitHub)" <gi...@apache.org>.
juliuszsompolski commented on PR #42818:
URL: https://github.com/apache/spark/pull/42818#issuecomment-1708631041

   https://github.com/juliuszsompolski/apache-spark/actions/runs/6097172856/job/16544375244
   ```
     test_other_than_dataframe_iter (pyspark.sql.tests.connect.test_parity_pandas_map.MapInPandasParityTests) ... malloc(): unsorted double linked list corrupted
   ERROR (612.473s)
     test_self_join (pyspark.sql.tests.connect.test_parity_pandas_map.MapInPandasParityTests) ... ERROR (611.358s)
   ERROR:root:Exception while sending command.
   Traceback (most recent call last):
     File "/__w/apache-spark/apache-spark/python/lib/py4j-0.10.9.7-src.zip/py4j/clientserver.py", line 516, in send_command
       raise Py4JNetworkError("Answer from Java side is empty")
   py4j.protocol.Py4JNetworkError: Answer from Java side is empty
   
   During handling of the above exception, another exception occurred:
   
   Traceback (most recent call last):
     File "/__w/apache-spark/apache-spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1038, in send_command
       response = connection.send_command(command)
     File "/__w/apache-spark/apache-spark/python/lib/py4j-0.10.9.7-src.zip/py4j/clientserver.py", line 539, in send_command
       raise Py4JNetworkError(
   py4j.protocol.Py4JNetworkError: Error while sending or receiving
   /__w/apache-spark/apache-spark/python/pyspark/context.py:657: RuntimeWarning: Unable to cleanly shutdown Spark JVM process. It is possible that the process has crashed, been killed or may also be in a zombie state.
     warnings.warn(
   /__w/apache-spark/apache-spark/python/pyspark/sql/connect/client/reattach.py:219: UserWarning: ReleaseExecute failed with exception: Cannot invoke RPC on closed channel!.
     warnings.warn(f"ReleaseExecute failed with exception: {e}.")
   
   ======================================================================
   ERROR [612.473s]: test_other_than_dataframe_iter (pyspark.sql.tests.connect.test_parity_pandas_map.MapInPandasParityTests)
   ----------------------------------------------------------------------
   Traceback (most recent call last):
     File "/__w/apache-spark/apache-spark/python/pyspark/sql/tests/connect/test_parity_pandas_map.py", line 26, in test_other_than_dataframe_iter
       self.check_other_than_dataframe_iter()
     File "/__w/apache-spark/apache-spark/python/pyspark/sql/tests/pandas/test_pandas_map.py", line 163, in check_other_than_dataframe_iter
       (self.spark.range(10, numPartitions=3).mapInPandas(bad_iter_elem, "a int").count())
     File "/__w/apache-spark/apache-spark/python/pyspark/sql/connect/dataframe.py", line 252, in count
       pdd = self.agg(_invoke_function("count", lit(1))).toPandas()
     File "/__w/apache-spark/apache-spark/python/pyspark/sql/connect/dataframe.py", line 1703, in toPandas
       return self._session.client.to_pandas(query)
     File "/__w/apache-spark/apache-spark/python/pyspark/sql/connect/client/core.py", line 873, in to_pandas
       table, schema, metrics, observed_metrics, _ = self._execute_and_fetch(
     File "/__w/apache-spark/apache-spark/python/pyspark/sql/connect/client/core.py", line 1282, in _execute_and_fetch
       for response in self._execute_and_fetch_as_iterator(req):
     File "/__w/apache-spark/apache-spark/python/pyspark/sql/connect/client/core.py", line 1263, in _execute_and_fetch_as_iterator
       self._handle_error(error)
     File "/__w/apache-spark/apache-spark/python/pyspark/sql/connect/client/core.py", line 1502, in _handle_error
       self._handle_rpc_error(error)
     File "/__w/apache-spark/apache-spark/python/pyspark/sql/connect/client/core.py", line 1542, in _handle_rpc_error
       raise SparkConnectGrpcException(str(rpc_error)) from None
   pyspark.errors.exceptions.connect.SparkConnectGrpcException: <_MultiThreadedRendezvous of RPC that terminated with:
   	status = StatusCode.UNAVAILABLE
   	details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:38175: Failed to connect to remote host: Connection refused"
   	debug_error_string = "UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:38175: Failed to connect to remote host: Connection refused {created_time:"2023-09-06T13:29:14.012172836+00:00", grpc_status:14}"
   >
   ```
   
   Unclean shutdown of the client after another failure, leaves the ReleaseExecute in another thread hanging... failure not related to this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] juliuszsompolski commented on a diff in pull request #42818: [SPARK-44835] Make INVALID_CURSOR.DISCONNECTED a retriable error

Posted by "juliuszsompolski (via GitHub)" <gi...@apache.org>.
juliuszsompolski commented on code in PR #42818:
URL: https://github.com/apache/spark/pull/42818#discussion_r1317561918


##########
connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/GrpcRetryHandler.scala:
##########
@@ -217,7 +217,23 @@ private[sql] object GrpcRetryHandler extends Logging {
    */
   private[client] def retryException(e: Throwable): Boolean = {
     e match {
-      case e: StatusRuntimeException => e.getStatus.getCode == Status.Code.UNAVAILABLE
+      case e: StatusRuntimeException =>
+        val statusCode: Status.Code = e.getStatus.getCode
+
+        if (List(Status.Code.INTERNAL)

Review Comment:
   Future proofing for adding more statusCodes in the future. A premature optimization :-).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on a diff in pull request #42818: [SPARK-44835] Make INVALID_CURSOR.DISCONNECTED a retriable error

Posted by "grundprinzip (via GitHub)" <gi...@apache.org>.
grundprinzip commented on code in PR #42818:
URL: https://github.com/apache/spark/pull/42818#discussion_r1316884962


##########
python/pyspark/sql/connect/client/core.py:
##########
@@ -585,11 +585,39 @@ class SparkConnectClient(object):
 
     @classmethod
     def retry_exception(cls, e: Exception) -> bool:
-        if isinstance(e, grpc.RpcError):
-            return e.code() == grpc.StatusCode.UNAVAILABLE
-        else:
+        """
+        Helper function that is used to identify if an exception thrown by the server
+        can be retried or not.
+
+        Parameters
+        ----------
+        e : Exception
+            The GRPC error as received from the server. Typed as Exception, because other exception
+            thrown during client processing can be passed here as well.
+
+        Returns
+        -------
+        True if the exception can be retried, False otherwise.
+
+        """
+        if not isinstance(e, grpc.RpcError):
             return False
 
+        if e.code() in [
+            grpc.StatusCode.INTERNAL
+        ]:
+            # This error happens if another RPC preempts this RPC, retry.
+            msg_cursor_disconnected = "INVALID_CURSOR.DISCONNECTED"
+
+            msg = str(e)
+            if any(map(lambda x: x in msg, [msg_cursor_disconnected])):
+                return True

Review Comment:
   we have some logic to convert the error messages from random RPC errors to spark understandable exceptions. I'm wondering if we should leverage this here as well.
   
   pyspark.errors.exceptions.connect.convert()



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon closed pull request #42818: [SPARK-44835][CONNECT] Make INVALID_CURSOR.DISCONNECTED a retriable error

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon closed pull request #42818: [SPARK-44835][CONNECT] Make INVALID_CURSOR.DISCONNECTED a retriable error
URL: https://github.com/apache/spark/pull/42818


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org