You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "hvanhovell (via GitHub)" <gi...@apache.org> on 2023/08/14 02:39:52 UTC

[GitHub] [spark] hvanhovell opened a new pull request, #42476: [SPARK-44794][CONNECT] Make Streaming Queries work with REPL generated classes.

hvanhovell opened a new pull request, #42476:
URL: https://github.com/apache/spark/pull/42476

   ### What changes were proposed in this pull request?
   When you try to run a streaming query from the REPL for example:
   ```scala
   val add1 = udf((i: Long) => i + 1)
   val query = spark.readStream
       .format("rate")
       .option("rowsPerSecond", "10")
       .option("numPartitions", "1")
       .load()
       .withColumn("value", add1($"value"))
       .writeStream
       .format("memory")
       .queryName("my_sink")
       .start()
   ```
   You are currently greeted by a hard to understand deserialization issue, where a serialization proxy cannot be assigned to a field. The underlying cause here is a `ClassNotFoundException` (yes, java serialization is weird). This  `ClassNotFoundException`  is caused by us not propagating the `JobArtifactState` (this - indirectly - contains information about the location of REPL generated classes, and session local libraries) properly to the streaming query execution thread.
   
   This PR fixed this by propagating the `JobArtifactState` into the stream execution thread.
   
   
   ### Why are the changes needed?
   It is a bug. We want streaming to work with connect's isolated dependencies.
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   I added a test to `ReplE2ESuite`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on a diff in pull request #42476: [SPARK-44794][CONNECT] Make Streaming Queries work with Connect's artifact management

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #42476:
URL: https://github.com/apache/spark/pull/42476#discussion_r1293766120


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala:
##########
@@ -204,7 +207,9 @@ abstract class StreamExecution(
         // To fix call site like "run at <unknown>:0", we bridge the call site from the caller
         // thread to this micro batch thread
         sparkSession.sparkContext.setCallSite(callSite)
-        runStream()
+        JobArtifactSet.withActiveJobArtifactState(jobArtifactState) {

Review Comment:
   FYI: @JerryLead, @huanliwang-db: This is based on thread-locals. Will it be ok if there are multiple micro-batches are active on multiple threads? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hvanhovell commented on pull request #42476: [SPARK-44794][CONNECT] Make Streaming Queries work with REPL generated classes.

Posted by "hvanhovell (via GitHub)" <gi...@apache.org>.
hvanhovell commented on PR #42476:
URL: https://github.com/apache/spark/pull/42476#issuecomment-1676591099

   @bogao007 PTAL


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hvanhovell commented on pull request #42476: [SPARK-44794][CONNECT] Make Streaming Queries work with Connect's artifact management

Posted by "hvanhovell (via GitHub)" <gi...@apache.org>.
hvanhovell commented on PR #42476:
URL: https://github.com/apache/spark/pull/42476#issuecomment-1679298833

   I am merging this one. The single test failing seems to be flaky.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WweiL commented on pull request #42476: [SPARK-44794][CONNECT] Make Streaming Queries work with Connect's artifact management

Posted by "WweiL (via GitHub)" <gi...@apache.org>.
WweiL commented on PR #42476:
URL: https://github.com/apache/spark/pull/42476#issuecomment-1677784730

   Thanks for the fix!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hvanhovell closed pull request #42476: [SPARK-44794][CONNECT] Make Streaming Queries work with Connect's artifact management

Posted by "hvanhovell (via GitHub)" <gi...@apache.org>.
hvanhovell closed pull request #42476: [SPARK-44794][CONNECT] Make Streaming Queries work with Connect's artifact management
URL: https://github.com/apache/spark/pull/42476


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org