You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by gu...@apache.org on 2022/11/11 12:18:22 UTC
[spark] branch master updated: [SPARK-41005][CONNECT][FOLLOWUP] Collect should use `submitJob` instead of `runJob`

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
     new 4f614b3f699 [SPARK-41005][CONNECT][FOLLOWUP] Collect should use `submitJob` instead of `runJob`
4f614b3f699 is described below

commit 4f614b3f699d4d3924d4411c98a20d2e58b2e2e6
Author: Ruifeng Zheng <ru...@apache.org>
AuthorDate: Fri Nov 11 21:18:09 2022 +0900

    [SPARK-41005][CONNECT][FOLLOWUP] Collect should use `submitJob` instead of `runJob`
    
    ### What changes were proposed in this pull request?
    use `submitJob` instead of `runJob`
    
    ### Why are the changes needed?
    `spark.sparkContext.runJob` is blocked until finishes all partitions
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Existing Tests
    
    Closes #38614 from zhengruifeng/connect_collect_submitJob.
    
    Authored-by: Ruifeng Zheng <ru...@apache.org>
    Signed-off-by: Hyukjin Kwon <gu...@apache.org>
---
 .../sql/connect/service/SparkConnectStreamHandler.scala   | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/connector/connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectStreamHandler.scala b/connector/connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectStreamHandler.scala
index ffac330cd6d..55e091bd8d0 100644
--- a/connector/connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectStreamHandler.scala
+++ b/connector/connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectStreamHandler.scala
@@ -161,17 +161,24 @@ class SparkConnectStreamHandler(responseObserver: StreamObserver[Response]) exte
           ()
         }
 
-        spark.sparkContext.runJob(batches, processPartition, resultHandler)
+        spark.sparkContext.submitJob(
+          rdd = batches,
+          processPartition = processPartition,
+          partitions = Seq.range(0, numPartitions),
+          resultHandler = resultHandler,
+          resultFunc = () => ())
 
         // The man thread will wait until 0-th partition is available,
-        // then send it to client and wait for next partition.
+        // then send it to client and wait for the next partition.
         var currentPartitionId = 0
         while (currentPartitionId < numPartitions) {
           val partition = signal.synchronized {
-            while (!partitions.contains(currentPartitionId)) {
+            var result = partitions.remove(currentPartitionId)
+            while (result.isEmpty) {
               signal.wait()
+              result = partitions.remove(currentPartitionId)
             }
-            partitions.remove(currentPartitionId).get
+            result.get
           }
 
           partition.foreach { case (bytes, count) =>


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org