You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "juliuszsompolski (via GitHub)" <gi...@apache.org> on 2023/11/30 20:37:02 UTC

[PR] [SPARK-46186][CONNECT] Fix illegal state transition when ExecuteThreadRunner interrupted before started [spark]

juliuszsompolski opened a new pull request, #44095:
URL: https://github.com/apache/spark/pull/44095

   ### What changes were proposed in this pull request?
   
   A race condition sending interrupt (or releaseSession) just after execute could cause:
   ```
   [info]   org.apache.spark.SparkException: java.lang.IllegalStateException: 
   [info]         operationId: fa653330-587a-41c7-a4a9-7987f070dcba with status Started
   [info]         is not within statuses List(Finished, Failed, Canceled) for event Closed
   ```
   and
   ```
   [info]   org.apache.spark.SparkException: java.lang.IllegalStateException: 
   [info]         operationId: fa653330-587a-41c7-a4a9-7987f070dcba with status Closed
   [info]         is not within statuses List(Started, Analyzed, ReadyForExecution, Finished, Failed) for event Canceled
   ```
   when the interrupt arrives before the thread in ExecuteThreadRunner is started.
   
   This would cause in ExecuteHolder close:
   ```
         runner.interrupt() <- just sets interrupted = true
         runner.join() <- thread didn't start yet, so join() returns immediately, doesn't wait for thread to be actually interrupted
         ...
         eventsManager.postClosed() <- causes the first failure, because thread wasn't running and didn't move to Canceled
   ```
   Afterwards, assuming we allowed the transition, the thread will get started, and then want to immediately move to Canceled, notice the `interrupted` state. Then it would hit the 2nd error, not allowing Canceled after Closed.
   
   While we could consider allowing the first transition (Started -> Closed), we don't want any events to be coming after Closed, so that listeners can clean their state after Closed.
   
   Fix is to handle interrupts coming before the thread started, and then prevent the thread from even starting if it was interruped.
   
   ### Why are the changes needed?
   
   This was detected after grpc 1.56 to 1.59 upgrade and causes some tests in SparkConnectServiceE2ESuite and ReattachableExecuteSuite to be flaky.
   With the grpc upgrade, execute is eagerly sent to the server, and in some test we cleanup and release the session without waiting for the execution to start. This has triggered this flakiness. 
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Test added. The test depends on timing, so may not fail reliably but only from time to time.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Github Copilot was assisting in some boilerplate auto-completion.
   
   Generated-by: Github Copilot


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46186][CONNECT] Fix illegal state transition when ExecuteThreadRunner interrupted before started [spark]

Posted by "juliuszsompolski (via GitHub)" <gi...@apache.org>.
juliuszsompolski commented on PR #44095:
URL: https://github.com/apache/spark/pull/44095#issuecomment-1834527580

   cc @hvanhovell @grundprinzip  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46186][CONNECT] Fix illegal state transition when ExecuteThreadRunner interrupted before started [spark]

Posted by "juliuszsompolski (via GitHub)" <gi...@apache.org>.
juliuszsompolski commented on PR #44095:
URL: https://github.com/apache/spark/pull/44095#issuecomment-1839107831

   I am aware of some flakiness in the added tests, I will followup tomorrow to make it stable.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46186][CONNECT] Fix illegal state transition when ExecuteThreadRunner interrupted before started [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #44095:
URL: https://github.com/apache/spark/pull/44095#issuecomment-1838227946

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46186][CONNECT] Fix illegal state transition when ExecuteThreadRunner interrupted before started [spark]

Posted by "juliuszsompolski (via GitHub)" <gi...@apache.org>.
juliuszsompolski commented on code in PR #44095:
URL: https://github.com/apache/spark/pull/44095#discussion_r1411243350


##########
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/ExecuteHolder.scala:
##########
@@ -128,13 +128,6 @@ private[connect] class ExecuteHolder(
     runner.start()
   }
 
-  /**
-   * Wait for the execution thread to finish and join it.
-   */
-  def join(): Unit = {
-    runner.join()
-  }

Review Comment:
   wasn't used, and should be internal, let's reduce the surface



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46186][CONNECT] Fix illegal state transition when ExecuteThreadRunner interrupted before started [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon closed pull request #44095: [SPARK-46186][CONNECT] Fix illegal state transition when ExecuteThreadRunner interrupted before started
URL: https://github.com/apache/spark/pull/44095


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46186][CONNECT] Fix illegal state transition when ExecuteThreadRunner interrupted before started [spark]

Posted by "juliuszsompolski (via GitHub)" <gi...@apache.org>.
juliuszsompolski commented on code in PR #44095:
URL: https://github.com/apache/spark/pull/44095#discussion_r1411245848


##########
connector/connect/server/src/test/scala/org/apache/spark/sql/connect/execution/ReattachableExecuteSuite.scala:
##########
@@ -296,6 +297,32 @@ class ReattachableExecuteSuite extends SparkConnectServerTest {
     }
   }
 
+  test("SPARK-46186 interrupt directly after query start") {
+    // This test depends on fast timing.
+    // If something is wrong, it can fail only from time to time.
+    withRawBlockingStub { stub =>
+      val operationId = UUID.randomUUID().toString
+      val interruptRequest = proto.InterruptRequest
+        .newBuilder
+        .setUserContext(userContext)
+        .setSessionId(defaultSessionId)
+        .setInterruptType(proto.InterruptRequest.InterruptType.INTERRUPT_TYPE_OPERATION_ID)
+        .setOperationId(operationId)
+        .build()
+      val iter = stub.executePlan(
+        buildExecutePlanRequest(buildPlan(MEDIUM_RESULTS_QUERY), operationId = operationId))
+      // wait for execute holder to exist, but the execute thread may not have started yet.

Review Comment:
   if we send it *too* fast, before the exeuction is even created, the interrupt could be a noop, and the query will start running uninterrupted.
   Interrupt returns in interrupt response that it didn't really interrupt any operation, but currently the clients don't implement checking this... I think trying to fix that further gets into really edge cases ...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46186][CONNECT] Fix illegal state transition when ExecuteThreadRunner interrupted before started [spark]

Posted by "juliuszsompolski (via GitHub)" <gi...@apache.org>.
juliuszsompolski commented on PR #44095:
URL: https://github.com/apache/spark/pull/44095#issuecomment-1838172064

   gentle ping @hvanhovell @grundprinzip 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org