You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "WeichenXu123 (via GitHub)" <gi...@apache.org> on 2024/01/03 02:44:35 UTC

Re: [PR] [SPARK-46361][PYTHON][CORE] Spark dataset chunk read api (developer API) [spark]

WeichenXu123 commented on code in PR #44294:
URL: https://github.com/apache/spark/pull/44294#discussion_r1440018036


##########
core/src/main/scala/org/apache/spark/SparkEnv.scala:
##########
@@ -99,6 +99,10 @@ class SparkEnv (
 
   private[spark] var executorBackend: Option[ExecutorBackend] = None
 
+  private[spark] var cachedArrowBatchServerPort: Option[Int] = None
+
+  private[spark] var cachedArrowBatchServerSecret: Option[String] = None

Review Comment:
   I am considering adding API like:
   
   ```
   # 1. User calls this developer API in pyspark UDF
   # to start a arrow stream server in local executor.
   server_port, server_secret = startChunkServer()
   
   # 2.read chunk data using the server created above.
   # user can call this function in pyspark UDF or descendent processes
   # of pyspark UDF.
   readChunk(chunk_id, server_port, server_secret)
   
   # 3. shut down the server created above
   shutdownChunkServer(server_port, server_secret)
   ```
   
   so that we can avoid each executor launches a long-running server.
   https://docs.google.com/document/d/1qs8lKQ3IwF5QGGAaa6OIiXYhdG4_HJtS66dswtx9kd0/edit#bookmark=id.f6cwxc97g3ig
   
   Then we don't need these variables



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org