You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "WweiL (via GitHub)" <gi...@apache.org> on 2023/08/16 21:26:24 UTC

[GitHub] [spark] WweiL opened a new pull request, #42521: [SPARK-44435][SS][CONNECT][DRAFT] Tests for foreachBatch and Listener

WweiL opened a new pull request, #42521:
URL: https://github.com/apache/spark/pull/42521

   # THIS IS STILL A DRAFT
   
   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'core/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WweiL commented on a diff in pull request #42521: [SPARK-44435][SS][CONNECT][DRAFT] Tests for foreachBatch and Listener

Posted by "WweiL (via GitHub)" <gi...@apache.org>.
WweiL commented on code in PR #42521:
URL: https://github.com/apache/spark/pull/42521#discussion_r1296449283


##########
python/pyspark/sql/tests/connect/streaming/test_parity_listener.py:
##########
@@ -19,38 +19,153 @@
 import time
 
 from pyspark.sql.tests.streaming.test_streaming_listener import StreamingListenerTestsMixin
-from pyspark.sql.streaming.listener import StreamingQueryListener, QueryStartedEvent
-from pyspark.sql.types import StructType, StructField, StringType
+from pyspark.sql.streaming.listener import (
+    StreamingQueryListener,
+    QueryStartedEvent,
+    QueryProgressEvent,
+    QueryIdleEvent,
+    QueryTerminatedEvent,
+)
+from pyspark.sql.types import (
+    ArrayType,
+    StructType,
+    StructField,
+    StringType,
+    IntegerType,
+    FloatType,
+    MapType,
+)
+from pyspark.sql.functions import count, lit
 from pyspark.testing.connectutils import ReusedConnectTestCase
 
 
 def get_start_event_schema():
     return StructType(
         [
-            StructField("id", StringType(), True),
-            StructField("runId", StringType(), True),
+            StructField("id", StringType(), False),
+            StructField("runId", StringType(), False),
             StructField("name", StringType(), True),
-            StructField("timestamp", StringType(), True),
+            StructField("timestamp", StringType(), False),
         ]
     )
 
 
+def get_idle_event_schema():
+    return StructType(
+        [
+            StructField("id", StringType(), False),
+            StructField("runId", StringType(), False),
+            StructField("timestamp", StringType(), False),
+        ]
+    )
+
+
+def get_terminated_event_schema():

Review Comment:
   I'm thinking of just move these methods to the `QueryxxxEvent`, maybe even create a method that called `asDataFrame`. To save user's effort to create this by themselves. For example, before:
   
   ```
   def get_start_event_schema():
       return StructType(
           [
               StructField("id", StringType(), False),
               StructField("runId", StringType(), False),
               StructField("name", StringType(), True),
               StructField("timestamp", StringType(), False),
           ]
       )
   
   class TestListener(StreamingQueryListener):
       def onQueryStarted(self, event):
           df = self.spark.createDataFrame(
               data=[(event.asDict())],
               schema=get_start_event_schema(),
           )
           df.write.saveAsTable("listener_start_events")
   ```
   Note that this looks simple because I wrote the `asDict` method, and test `get_start_event_schema` method. In production, users need to do this themselves. But these are really redundant, if we manage these:
   
   For example:
   
   ```
   class TestListener(StreamingQueryListener):
       def onQueryStarted(self, event):
           df = self.spark.createDataFrame(
               data=[(event.asDict())],
               schema=event.schema(),
           )
           df.write.saveAsTable("listener_start_events")
   
   ============= OR ===============
   class TestListener(StreamingQueryListener):
       def onQueryStarted(self, event):
           df = event.asDataFrame(self.spark)
           df.write.saveAsTable("listener_start_events")
   ```



##########
python/pyspark/sql/tests/connect/streaming/test_parity_listener.py:
##########
@@ -19,38 +19,153 @@
 import time
 
 from pyspark.sql.tests.streaming.test_streaming_listener import StreamingListenerTestsMixin
-from pyspark.sql.streaming.listener import StreamingQueryListener, QueryStartedEvent
-from pyspark.sql.types import StructType, StructField, StringType
+from pyspark.sql.streaming.listener import (
+    StreamingQueryListener,
+    QueryStartedEvent,
+    QueryProgressEvent,
+    QueryIdleEvent,
+    QueryTerminatedEvent,
+)
+from pyspark.sql.types import (
+    ArrayType,
+    StructType,
+    StructField,
+    StringType,
+    IntegerType,
+    FloatType,
+    MapType,
+)
+from pyspark.sql.functions import count, lit
 from pyspark.testing.connectutils import ReusedConnectTestCase
 
 
 def get_start_event_schema():
     return StructType(
         [
-            StructField("id", StringType(), True),
-            StructField("runId", StringType(), True),
+            StructField("id", StringType(), False),
+            StructField("runId", StringType(), False),
             StructField("name", StringType(), True),
-            StructField("timestamp", StringType(), True),
+            StructField("timestamp", StringType(), False),
         ]
     )

Review Comment:
   I'm thinking of just move these methods to the `QueryxxxEvent`, maybe even create a method that called `asDataFrame`. To save user's effort to create this by themselves. For example, before:
   
   ```
   def get_start_event_schema():
       return StructType(
           [
               StructField("id", StringType(), False),
               StructField("runId", StringType(), False),
               StructField("name", StringType(), True),
               StructField("timestamp", StringType(), False),
           ]
       )
   
   class TestListener(StreamingQueryListener):
       def onQueryStarted(self, event):
           df = self.spark.createDataFrame(
               data=[(event.asDict())],
               schema=get_start_event_schema(),
           )
           df.write.saveAsTable("listener_start_events")
   ```
   Note that this looks simple because I wrote the `asDict` method, and test `get_start_event_schema` method. In production, users need to do this themselves. But these are really redundant, if we manage these:
   
   For example:
   
   ```
   class TestListener(StreamingQueryListener):
       def onQueryStarted(self, event):
           df = self.spark.createDataFrame(
               data=[(event.asDict())],
               schema=event.schema(),
           )
           df.write.saveAsTable("listener_start_events")
   
   ============= OR ===============
   class TestListener(StreamingQueryListener):
       def onQueryStarted(self, event):
           df = event.asDataFrame(self.spark)
           df.write.saveAsTable("listener_start_events")
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on a diff in pull request #42521: [SPARK-44435][SS][CONNECT][DRAFT] Tests for foreachBatch and Listener

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #42521:
URL: https://github.com/apache/spark/pull/42521#discussion_r1298862833


##########
python/pyspark/sql/tests/connect/streaming/test_parity_listener.py:
##########
@@ -19,38 +19,153 @@
 import time
 
 from pyspark.sql.tests.streaming.test_streaming_listener import StreamingListenerTestsMixin
-from pyspark.sql.streaming.listener import StreamingQueryListener, QueryStartedEvent
-from pyspark.sql.types import StructType, StructField, StringType
+from pyspark.sql.streaming.listener import (
+    StreamingQueryListener,
+    QueryStartedEvent,
+    QueryProgressEvent,
+    QueryIdleEvent,
+    QueryTerminatedEvent,
+)
+from pyspark.sql.types import (
+    ArrayType,
+    StructType,
+    StructField,
+    StringType,
+    IntegerType,
+    FloatType,
+    MapType,
+)
+from pyspark.sql.functions import count, lit
 from pyspark.testing.connectutils import ReusedConnectTestCase
 
 
 def get_start_event_schema():
     return StructType(
         [
-            StructField("id", StringType(), True),
-            StructField("runId", StringType(), True),
+            StructField("id", StringType(), False),
+            StructField("runId", StringType(), False),
             StructField("name", StringType(), True),
-            StructField("timestamp", StringType(), True),
+            StructField("timestamp", StringType(), False),
         ]
     )

Review Comment:
   Sorry about that. Removed my comment.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WweiL commented on a diff in pull request #42521: [SPARK-44435][SS][CONNECT][DRAFT] Tests for foreachBatch and Listener

Posted by "WweiL (via GitHub)" <gi...@apache.org>.
WweiL commented on code in PR #42521:
URL: https://github.com/apache/spark/pull/42521#discussion_r1296449283


##########
python/pyspark/sql/tests/connect/streaming/test_parity_listener.py:
##########
@@ -19,38 +19,153 @@
 import time
 
 from pyspark.sql.tests.streaming.test_streaming_listener import StreamingListenerTestsMixin
-from pyspark.sql.streaming.listener import StreamingQueryListener, QueryStartedEvent
-from pyspark.sql.types import StructType, StructField, StringType
+from pyspark.sql.streaming.listener import (
+    StreamingQueryListener,
+    QueryStartedEvent,
+    QueryProgressEvent,
+    QueryIdleEvent,
+    QueryTerminatedEvent,
+)
+from pyspark.sql.types import (
+    ArrayType,
+    StructType,
+    StructField,
+    StringType,
+    IntegerType,
+    FloatType,
+    MapType,
+)
+from pyspark.sql.functions import count, lit
 from pyspark.testing.connectutils import ReusedConnectTestCase
 
 
 def get_start_event_schema():
     return StructType(
         [
-            StructField("id", StringType(), True),
-            StructField("runId", StringType(), True),
+            StructField("id", StringType(), False),
+            StructField("runId", StringType(), False),
             StructField("name", StringType(), True),
-            StructField("timestamp", StringType(), True),
+            StructField("timestamp", StringType(), False),
         ]
     )
 
 
+def get_idle_event_schema():
+    return StructType(
+        [
+            StructField("id", StringType(), False),
+            StructField("runId", StringType(), False),
+            StructField("timestamp", StringType(), False),
+        ]
+    )
+
+
+def get_terminated_event_schema():

Review Comment:
   I'm thinking of just move these methods to the `QueryxxxEvent`, maybe even create a method that called `asDataFrame`. To save user's effort to create this by themselves. For example, before:
   
   ```
   def get_start_event_schema():
       return StructType(
           [
               StructField("id", StringType(), False),
               StructField("runId", StringType(), False),
               StructField("name", StringType(), True),
               StructField("timestamp", StringType(), False),
           ]
       )
   
   class TestListener(StreamingQueryListener):
       def onQueryStarted(self, event):
           df = self.spark.createDataFrame(
               data=[(event.asDict())],
               schema=get_start_event_schema(),
           )
           df.write.saveAsTable("listener_start_events")
   ```
   Note that this looks simple because I wrote the `asDict` method, and test `get_start_event_schema` method. In production, users need to do this themselves. But these are really redundant, if we manage these:
   
   For example:
   
   ```
   class TestListener(StreamingQueryListener):
       def onQueryStarted(self, event):
           df = self.spark.createDataFrame(
               data=[(event.asDict())],
               schema=event.schema(),
           )
           df.write.saveAsTable("listener_start_events")
   
   ============= OR ===============
   class TestListener(StreamingQueryListener):
       def onQueryStarted(self, event):
           df = event.asDataFrame(self.spark)
           df.write.saveAsTable("listener_start_events")
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on a diff in pull request #42521: [SPARK-44435][SS][CONNECT][DRAFT] Tests for foreachBatch and Listener

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #42521:
URL: https://github.com/apache/spark/pull/42521#discussion_r1298858164


##########
python/pyspark/sql/tests/connect/streaming/test_parity_listener.py:
##########
@@ -19,38 +19,153 @@
 import time
 
 from pyspark.sql.tests.streaming.test_streaming_listener import StreamingListenerTestsMixin
-from pyspark.sql.streaming.listener import StreamingQueryListener, QueryStartedEvent
-from pyspark.sql.types import StructType, StructField, StringType
+from pyspark.sql.streaming.listener import (
+    StreamingQueryListener,
+    QueryStartedEvent,
+    QueryProgressEvent,
+    QueryIdleEvent,
+    QueryTerminatedEvent,
+)
+from pyspark.sql.types import (
+    ArrayType,
+    StructType,
+    StructField,
+    StringType,
+    IntegerType,
+    FloatType,
+    MapType,
+)
+from pyspark.sql.functions import count, lit
 from pyspark.testing.connectutils import ReusedConnectTestCase
 
 
 def get_start_event_schema():
     return StructType(
         [
-            StructField("id", StringType(), True),
-            StructField("runId", StringType(), True),
+            StructField("id", StringType(), False),
+            StructField("runId", StringType(), False),
             StructField("name", StringType(), True),
-            StructField("timestamp", StringType(), True),
+            StructField("timestamp", StringType(), False),
         ]
     )

Review Comment:
   I think for this one, we can drop None values in for loop inside `_invoke_function()`. There is no impact on Protobuf. 
   But not needed in the PR. Can be done as improvement.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WweiL commented on a diff in pull request #42521: [SPARK-44435][SS][CONNECT][DRAFT] Tests for foreachBatch and Listener

Posted by "WweiL (via GitHub)" <gi...@apache.org>.
WweiL commented on code in PR #42521:
URL: https://github.com/apache/spark/pull/42521#discussion_r1296449283


##########
python/pyspark/sql/tests/connect/streaming/test_parity_listener.py:
##########
@@ -19,38 +19,153 @@
 import time
 
 from pyspark.sql.tests.streaming.test_streaming_listener import StreamingListenerTestsMixin
-from pyspark.sql.streaming.listener import StreamingQueryListener, QueryStartedEvent
-from pyspark.sql.types import StructType, StructField, StringType
+from pyspark.sql.streaming.listener import (
+    StreamingQueryListener,
+    QueryStartedEvent,
+    QueryProgressEvent,
+    QueryIdleEvent,
+    QueryTerminatedEvent,
+)
+from pyspark.sql.types import (
+    ArrayType,
+    StructType,
+    StructField,
+    StringType,
+    IntegerType,
+    FloatType,
+    MapType,
+)
+from pyspark.sql.functions import count, lit
 from pyspark.testing.connectutils import ReusedConnectTestCase
 
 
 def get_start_event_schema():
     return StructType(
         [
-            StructField("id", StringType(), True),
-            StructField("runId", StringType(), True),
+            StructField("id", StringType(), False),
+            StructField("runId", StringType(), False),
             StructField("name", StringType(), True),
-            StructField("timestamp", StringType(), True),
+            StructField("timestamp", StringType(), False),
         ]
     )
 
 
+def get_idle_event_schema():
+    return StructType(
+        [
+            StructField("id", StringType(), False),
+            StructField("runId", StringType(), False),
+            StructField("timestamp", StringType(), False),
+        ]
+    )
+
+
+def get_terminated_event_schema():

Review Comment:
   I'm thinking of just move these methods to the `QueryxxxEvent`, maybe even create a method that called `asDataFrame`. To save user's effort to create this by themselves. For example, before:
   
   ```
   
   class TestListener(StreamingQueryListener):
       def onQueryStarted(self, event):
           df = self.spark.createDataFrame(
               data=[(event.asDict())],
               schema=event.schema(),
           )
           df.write.saveAsTable("listener_start_events")
   ```
   
   After:
   
   ```
   class TestListener(StreamingQueryListener):
       def onQueryStarted(self, event):
           df = event.asDataFrame(self.spark)
           df.write.saveAsTable("listener_start_events")
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WweiL commented on a diff in pull request #42521: [SPARK-44435][SS][CONNECT][DRAFT] Tests for foreachBatch and Listener

Posted by "WweiL (via GitHub)" <gi...@apache.org>.
WweiL commented on code in PR #42521:
URL: https://github.com/apache/spark/pull/42521#discussion_r1296452084


##########
python/pyspark/sql/tests/connect/streaming/test_parity_listener.py:
##########
@@ -19,38 +19,153 @@
 import time
 
 from pyspark.sql.tests.streaming.test_streaming_listener import StreamingListenerTestsMixin
-from pyspark.sql.streaming.listener import StreamingQueryListener, QueryStartedEvent
-from pyspark.sql.types import StructType, StructField, StringType
+from pyspark.sql.streaming.listener import (
+    StreamingQueryListener,
+    QueryStartedEvent,
+    QueryProgressEvent,
+    QueryIdleEvent,
+    QueryTerminatedEvent,
+)
+from pyspark.sql.types import (
+    ArrayType,
+    StructType,
+    StructField,
+    StringType,
+    IntegerType,
+    FloatType,
+    MapType,
+)
+from pyspark.sql.functions import count, lit
 from pyspark.testing.connectutils import ReusedConnectTestCase
 
 
 def get_start_event_schema():
     return StructType(
         [
-            StructField("id", StringType(), True),
-            StructField("runId", StringType(), True),
+            StructField("id", StringType(), False),
+            StructField("runId", StringType(), False),
             StructField("name", StringType(), True),
-            StructField("timestamp", StringType(), True),
+            StructField("timestamp", StringType(), False),
         ]
     )

Review Comment:
   What do you think? @rangadi  @bogao007 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] bogao007 commented on a diff in pull request #42521: [SPARK-44435][SS][CONNECT][DRAFT] Tests for foreachBatch and Listener

Posted by "bogao007 (via GitHub)" <gi...@apache.org>.
bogao007 commented on code in PR #42521:
URL: https://github.com/apache/spark/pull/42521#discussion_r1298847895


##########
python/pyspark/sql/tests/connect/streaming/test_parity_listener.py:
##########
@@ -19,38 +19,153 @@
 import time
 
 from pyspark.sql.tests.streaming.test_streaming_listener import StreamingListenerTestsMixin
-from pyspark.sql.streaming.listener import StreamingQueryListener, QueryStartedEvent
-from pyspark.sql.types import StructType, StructField, StringType
+from pyspark.sql.streaming.listener import (
+    StreamingQueryListener,
+    QueryStartedEvent,
+    QueryProgressEvent,
+    QueryIdleEvent,
+    QueryTerminatedEvent,
+)
+from pyspark.sql.types import (
+    ArrayType,
+    StructType,
+    StructField,
+    StringType,
+    IntegerType,
+    FloatType,
+    MapType,
+)
+from pyspark.sql.functions import count, lit
 from pyspark.testing.connectutils import ReusedConnectTestCase
 
 
 def get_start_event_schema():
     return StructType(
         [
-            StructField("id", StringType(), True),
-            StructField("runId", StringType(), True),
+            StructField("id", StringType(), False),
+            StructField("runId", StringType(), False),
             StructField("name", StringType(), True),
-            StructField("timestamp", StringType(), True),
+            StructField("timestamp", StringType(), False),
         ]
     )

Review Comment:
   yeah I saw the below code that constructs schemas, it is a pain... I think it would be good if we could simplify this in a certain way, this would save a lot of redundant work on both customer side and our side (writing tests/ internal team usage).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WweiL commented on a diff in pull request #42521: [SPARK-44435][SS][CONNECT][DRAFT] Tests for foreachBatch and Listener

Posted by "WweiL (via GitHub)" <gi...@apache.org>.
WweiL commented on code in PR #42521:
URL: https://github.com/apache/spark/pull/42521#discussion_r1300486329


##########
python/pyspark/sql/tests/connect/streaming/test_parity_listener.py:
##########
@@ -19,38 +19,153 @@
 import time
 
 from pyspark.sql.tests.streaming.test_streaming_listener import StreamingListenerTestsMixin
-from pyspark.sql.streaming.listener import StreamingQueryListener, QueryStartedEvent
-from pyspark.sql.types import StructType, StructField, StringType
+from pyspark.sql.streaming.listener import (
+    StreamingQueryListener,
+    QueryStartedEvent,
+    QueryProgressEvent,
+    QueryIdleEvent,
+    QueryTerminatedEvent,
+)
+from pyspark.sql.types import (
+    ArrayType,
+    StructType,
+    StructField,
+    StringType,
+    IntegerType,
+    FloatType,
+    MapType,
+)
+from pyspark.sql.functions import count, lit
 from pyspark.testing.connectutils import ReusedConnectTestCase
 
 
 def get_start_event_schema():
     return StructType(
         [
-            StructField("id", StringType(), True),
-            StructField("runId", StringType(), True),
+            StructField("id", StringType(), False),
+            StructField("runId", StringType(), False),
             StructField("name", StringType(), True),
-            StructField("timestamp", StringType(), True),
+            StructField("timestamp", StringType(), False),
         ]
     )

Review Comment:
   Synced with @rangadi, he thinks this is not needed. I'll put everything into test suite



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WweiL commented on a diff in pull request #42521: [SPARK-44435][SS][CONNECT][DRAFT] Tests for foreachBatch and Listener

Posted by "WweiL (via GitHub)" <gi...@apache.org>.
WweiL commented on code in PR #42521:
URL: https://github.com/apache/spark/pull/42521#discussion_r1300307559


##########
python/pyspark/sql/tests/connect/streaming/test_parity_listener.py:
##########
@@ -19,38 +19,153 @@
 import time
 
 from pyspark.sql.tests.streaming.test_streaming_listener import StreamingListenerTestsMixin
-from pyspark.sql.streaming.listener import StreamingQueryListener, QueryStartedEvent
-from pyspark.sql.types import StructType, StructField, StringType
+from pyspark.sql.streaming.listener import (
+    StreamingQueryListener,
+    QueryStartedEvent,
+    QueryProgressEvent,
+    QueryIdleEvent,
+    QueryTerminatedEvent,
+)
+from pyspark.sql.types import (
+    ArrayType,
+    StructType,
+    StructField,
+    StringType,
+    IntegerType,
+    FloatType,
+    MapType,
+)
+from pyspark.sql.functions import count, lit
 from pyspark.testing.connectutils import ReusedConnectTestCase
 
 
 def get_start_event_schema():
     return StructType(
         [
-            StructField("id", StringType(), True),
-            StructField("runId", StringType(), True),
+            StructField("id", StringType(), False),
+            StructField("runId", StringType(), False),
             StructField("name", StringType(), True),
-            StructField("timestamp", StringType(), True),
+            StructField("timestamp", StringType(), False),
         ]
     )

Review Comment:
   @HyukjinKwon 
   I'm looking at the [test error here](https://github.com/WweiL/oss-spark/actions/runs/5884139868/job/15959887335) -- I couldn't reproduce it locally. 
   
   But I think the change is orthogonal to the test error. It's more about an addition to the listener events API. We can just define the `asDict`, and `get_event_schema` method in the test suite. And the test still runs. For example, in current master, the `onQueryStartedEvent` is implemented like this:
   https://github.com/apache/spark/blob/master/python/pyspark/sql/tests/connect/streaming/test_parity_listener.py#L27-L44
   
   But that would mean users need to add exactly the same redundant code I added in the suite if they want to write the event to external table. That looks not as painful, but `onQueryProgress` would be extremely painful, because the complexity of the event schema. 
   
   Because it's very likely every user who want to write events to external tables need to redo the same code all times, I'm thinking providing the API so they don't need to reinvent the wheel



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #42521: [SPARK-44435][SS][CONNECT] Tests for foreachBatch and Listener

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #42521:
URL: https://github.com/apache/spark/pull/42521#issuecomment-1691409733

   Merged to master and branch-3.5.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WweiL commented on pull request #42521: [SPARK-44435][SS][CONNECT] Tests for foreachBatch and Listener

Posted by "WweiL (via GitHub)" <gi...@apache.org>.
WweiL commented on PR #42521:
URL: https://github.com/apache/spark/pull/42521#issuecomment-1692111903

   @LuciferYang Thanks for the ping! Let me checkout 3.5 and see


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WweiL commented on pull request #42521: [SPARK-44435][SS][CONNECT] Tests for foreachBatch and Listener

Posted by "WweiL (via GitHub)" <gi...@apache.org>.
WweiL commented on PR #42521:
URL: https://github.com/apache/spark/pull/42521#issuecomment-1692382795

   Create a separate PR to 3.5
   https://github.com/apache/spark/pull/42664


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WweiL commented on a diff in pull request #42521: [SPARK-44435][SS][CONNECT][DRAFT] Tests for foreachBatch and Listener

Posted by "WweiL (via GitHub)" <gi...@apache.org>.
WweiL commented on code in PR #42521:
URL: https://github.com/apache/spark/pull/42521#discussion_r1300307559


##########
python/pyspark/sql/tests/connect/streaming/test_parity_listener.py:
##########
@@ -19,38 +19,153 @@
 import time
 
 from pyspark.sql.tests.streaming.test_streaming_listener import StreamingListenerTestsMixin
-from pyspark.sql.streaming.listener import StreamingQueryListener, QueryStartedEvent
-from pyspark.sql.types import StructType, StructField, StringType
+from pyspark.sql.streaming.listener import (
+    StreamingQueryListener,
+    QueryStartedEvent,
+    QueryProgressEvent,
+    QueryIdleEvent,
+    QueryTerminatedEvent,
+)
+from pyspark.sql.types import (
+    ArrayType,
+    StructType,
+    StructField,
+    StringType,
+    IntegerType,
+    FloatType,
+    MapType,
+)
+from pyspark.sql.functions import count, lit
 from pyspark.testing.connectutils import ReusedConnectTestCase
 
 
 def get_start_event_schema():
     return StructType(
         [
-            StructField("id", StringType(), True),
-            StructField("runId", StringType(), True),
+            StructField("id", StringType(), False),
+            StructField("runId", StringType(), False),
             StructField("name", StringType(), True),
-            StructField("timestamp", StringType(), True),
+            StructField("timestamp", StringType(), False),
         ]
     )

Review Comment:
   @HyukjinKwon 
   I'm looking at the [test error here](https://github.com/WweiL/oss-spark/actions/runs/5884139868/job/15959887335) -- I couldn't reproduce it locally. 
   
   But I think the change is orthogonal to the test error. It's more about an addition to the listener events API. We can just define the `asDict`, and `get_event_schema` method in the test suite. And the test still runs.
   
   But that would mean users need to add exactly the same redundant code I added in the suite if they want to write the event to external table. For example, before this addition, the `onQueryStartedEvent` is implemented like this:
   https://github.com/apache/spark/blob/master/python/pyspark/sql/tests/connect/streaming/test_parity_listener.py#L27-L44
   
   That looks not as painful, but `onQueryProgress` would be extremely painful I think. 
   
   Because it's very likely every user who want to write events to external tables need to redo the same code all times, I'm thinking providing the API so they don't need to reinvent the wheel



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WweiL commented on a diff in pull request #42521: [SPARK-44435][SS][CONNECT][DRAFT] Tests for foreachBatch and Listener

Posted by "WweiL (via GitHub)" <gi...@apache.org>.
WweiL commented on code in PR #42521:
URL: https://github.com/apache/spark/pull/42521#discussion_r1300307559


##########
python/pyspark/sql/tests/connect/streaming/test_parity_listener.py:
##########
@@ -19,38 +19,153 @@
 import time
 
 from pyspark.sql.tests.streaming.test_streaming_listener import StreamingListenerTestsMixin
-from pyspark.sql.streaming.listener import StreamingQueryListener, QueryStartedEvent
-from pyspark.sql.types import StructType, StructField, StringType
+from pyspark.sql.streaming.listener import (
+    StreamingQueryListener,
+    QueryStartedEvent,
+    QueryProgressEvent,
+    QueryIdleEvent,
+    QueryTerminatedEvent,
+)
+from pyspark.sql.types import (
+    ArrayType,
+    StructType,
+    StructField,
+    StringType,
+    IntegerType,
+    FloatType,
+    MapType,
+)
+from pyspark.sql.functions import count, lit
 from pyspark.testing.connectutils import ReusedConnectTestCase
 
 
 def get_start_event_schema():
     return StructType(
         [
-            StructField("id", StringType(), True),
-            StructField("runId", StringType(), True),
+            StructField("id", StringType(), False),
+            StructField("runId", StringType(), False),
             StructField("name", StringType(), True),
-            StructField("timestamp", StringType(), True),
+            StructField("timestamp", StringType(), False),
         ]
     )

Review Comment:
   @HyukjinKwon 
   I'm looking at the [test error here](https://github.com/WweiL/oss-spark/actions/runs/5884139868/job/15959887335) -- I couldn't reproduce it locally. 
   
   But I think the change is orthogonal to the test error. It's more about an addition to the listener events API. We can just define the `asDict`, and `get_event_schema` method in the test suite. And the test still runs. For example, in current master, the `onQueryStartedEvent` is implemented like this:
   https://github.com/apache/spark/blob/master/python/pyspark/sql/tests/connect/streaming/test_parity_listener.py#L27-L44
   
   But that would mean users need to add exactly the same redundant code I added in the suite if they want to write the event to external table. That looks not as painful, but `onQueryProgress` would be extremely painful I think. 
   
   Because it's very likely every user who want to write events to external tables need to redo the same code all times, I'm thinking providing the API so they don't need to reinvent the wheel



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #42521: [SPARK-44435][SS][CONNECT] Tests for foreachBatch and Listener

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #42521:
URL: https://github.com/apache/spark/pull/42521#issuecomment-1692143698

   This is reverted from branch-3.5 via https://github.com/apache/spark/commit/6c2da61b386d905d05437e68a4b945b5ee9a3e90 .
   I'm going to monitor the CIs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WweiL commented on a diff in pull request #42521: [SPARK-44435][SS][CONNECT][DRAFT] Tests for foreachBatch and Listener

Posted by "WweiL (via GitHub)" <gi...@apache.org>.
WweiL commented on code in PR #42521:
URL: https://github.com/apache/spark/pull/42521#discussion_r1300307559


##########
python/pyspark/sql/tests/connect/streaming/test_parity_listener.py:
##########
@@ -19,38 +19,153 @@
 import time
 
 from pyspark.sql.tests.streaming.test_streaming_listener import StreamingListenerTestsMixin
-from pyspark.sql.streaming.listener import StreamingQueryListener, QueryStartedEvent
-from pyspark.sql.types import StructType, StructField, StringType
+from pyspark.sql.streaming.listener import (
+    StreamingQueryListener,
+    QueryStartedEvent,
+    QueryProgressEvent,
+    QueryIdleEvent,
+    QueryTerminatedEvent,
+)
+from pyspark.sql.types import (
+    ArrayType,
+    StructType,
+    StructField,
+    StringType,
+    IntegerType,
+    FloatType,
+    MapType,
+)
+from pyspark.sql.functions import count, lit
 from pyspark.testing.connectutils import ReusedConnectTestCase
 
 
 def get_start_event_schema():
     return StructType(
         [
-            StructField("id", StringType(), True),
-            StructField("runId", StringType(), True),
+            StructField("id", StringType(), False),
+            StructField("runId", StringType(), False),
             StructField("name", StringType(), True),
-            StructField("timestamp", StringType(), True),
+            StructField("timestamp", StringType(), False),
         ]
     )

Review Comment:
   @HyukjinKwon 
   I'm looking at the [test error here](https://github.com/WweiL/oss-spark/actions/runs/5884139868/job/15959887335) -- I couldn't reproduce it locally. 
   
   But I think the change is orthogonal to the test error. It's more about an addition to the listener events API. We can just define the `asDict`, and `get_event_schema` method in the test suite. And the test still runs.
   
   But that would mean users need to add exactly the same redundant code I added in the suite if they want to write the event to external table. For example, in current master, the `onQueryStartedEvent` is implemented like this:
   https://github.com/apache/spark/blob/master/python/pyspark/sql/tests/connect/streaming/test_parity_listener.py#L27-L44
   
   That looks not as painful, but `onQueryProgress` would be extremely painful I think. 
   
   Because it's very likely every user who want to write events to external tables need to redo the same code all times, I'm thinking providing the API so they don't need to reinvent the wheel



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WweiL commented on a diff in pull request #42521: [SPARK-44435][SS][CONNECT][DRAFT] Tests for foreachBatch and Listener

Posted by "WweiL (via GitHub)" <gi...@apache.org>.
WweiL commented on code in PR #42521:
URL: https://github.com/apache/spark/pull/42521#discussion_r1297887012


##########
python/pyspark/sql/tests/connect/streaming/test_parity_listener.py:
##########
@@ -19,38 +19,153 @@
 import time
 
 from pyspark.sql.tests.streaming.test_streaming_listener import StreamingListenerTestsMixin
-from pyspark.sql.streaming.listener import StreamingQueryListener, QueryStartedEvent
-from pyspark.sql.types import StructType, StructField, StringType
+from pyspark.sql.streaming.listener import (
+    StreamingQueryListener,
+    QueryStartedEvent,
+    QueryProgressEvent,
+    QueryIdleEvent,
+    QueryTerminatedEvent,
+)
+from pyspark.sql.types import (
+    ArrayType,
+    StructType,
+    StructField,
+    StringType,
+    IntegerType,
+    FloatType,
+    MapType,
+)
+from pyspark.sql.functions import count, lit
 from pyspark.testing.connectutils import ReusedConnectTestCase
 
 
 def get_start_event_schema():
     return StructType(
         [
-            StructField("id", StringType(), True),
-            StructField("runId", StringType(), True),
+            StructField("id", StringType(), False),
+            StructField("runId", StringType(), False),
             StructField("name", StringType(), True),
-            StructField("timestamp", StringType(), True),
+            StructField("timestamp", StringType(), False),
         ]
     )

Review Comment:
   Also cc @HyukjinKwon @HeartSaVioR 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WweiL commented on a diff in pull request #42521: [SPARK-44435][SS][CONNECT][DRAFT] Tests for foreachBatch and Listener

Posted by "WweiL (via GitHub)" <gi...@apache.org>.
WweiL commented on code in PR #42521:
URL: https://github.com/apache/spark/pull/42521#discussion_r1296451667


##########
python/pyspark/sql/tests/connect/streaming/test_parity_listener.py:
##########
@@ -19,38 +19,153 @@
 import time
 
 from pyspark.sql.tests.streaming.test_streaming_listener import StreamingListenerTestsMixin
-from pyspark.sql.streaming.listener import StreamingQueryListener, QueryStartedEvent
-from pyspark.sql.types import StructType, StructField, StringType
+from pyspark.sql.streaming.listener import (
+    StreamingQueryListener,
+    QueryStartedEvent,
+    QueryProgressEvent,
+    QueryIdleEvent,
+    QueryTerminatedEvent,
+)
+from pyspark.sql.types import (
+    ArrayType,
+    StructType,
+    StructField,
+    StringType,
+    IntegerType,
+    FloatType,
+    MapType,
+)
+from pyspark.sql.functions import count, lit
 from pyspark.testing.connectutils import ReusedConnectTestCase
 
 
 def get_start_event_schema():
     return StructType(
         [
-            StructField("id", StringType(), True),
-            StructField("runId", StringType(), True),
+            StructField("id", StringType(), False),
+            StructField("runId", StringType(), False),
             StructField("name", StringType(), True),
-            StructField("timestamp", StringType(), True),
+            StructField("timestamp", StringType(), False),
         ]
     )

Review Comment:
   I assume in connect mode, it would be more common for people to write the listener events directly to tables. Because that's probably one of the only few ways to access these events now in spark connect.
   
   So I'm thinking of just move these methods to the `QueryxxxEvent`, maybe even create a method that called `asDataFrame`, to save user's effort to create this by themselves. For example, before:
   
   ```
   def get_start_event_schema():
       return StructType(
           [
               StructField("id", StringType(), False),
               StructField("runId", StringType(), False),
               StructField("name", StringType(), True),
               StructField("timestamp", StringType(), False),
           ]
       )
   
   class TestListener(StreamingQueryListener):
       def onQueryStarted(self, event):
           df = self.spark.createDataFrame(
               data=[(event.asDict())],
               schema=get_start_event_schema(),
           )
           df.write.saveAsTable("listener_start_events")
   ```
   Note that this looks simple because I wrote the `asDict` method, and the `get_start_event_schema` method for test. In production, users need to do this themselves. But these are really redundant, if we could add a helper method:
   
   For example:
   
   ```
   class TestListener(StreamingQueryListener):
       def onQueryStarted(self, event):
           df = self.spark.createDataFrame(
               data=[(event.asDict())],
               schema=event.schema(),
           )
           df.write.saveAsTable("listener_start_events")
   
   ============= OR ===============
   class TestListener(StreamingQueryListener):
       def onQueryStarted(self, event):
           df = event.asDataFrame(self.spark)
           df.write.saveAsTable("listener_start_events")
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WweiL commented on a diff in pull request #42521: [SPARK-44435][SS][CONNECT][DRAFT] Tests for foreachBatch and Listener

Posted by "WweiL (via GitHub)" <gi...@apache.org>.
WweiL commented on code in PR #42521:
URL: https://github.com/apache/spark/pull/42521#discussion_r1296451667


##########
python/pyspark/sql/tests/connect/streaming/test_parity_listener.py:
##########
@@ -19,38 +19,153 @@
 import time
 
 from pyspark.sql.tests.streaming.test_streaming_listener import StreamingListenerTestsMixin
-from pyspark.sql.streaming.listener import StreamingQueryListener, QueryStartedEvent
-from pyspark.sql.types import StructType, StructField, StringType
+from pyspark.sql.streaming.listener import (
+    StreamingQueryListener,
+    QueryStartedEvent,
+    QueryProgressEvent,
+    QueryIdleEvent,
+    QueryTerminatedEvent,
+)
+from pyspark.sql.types import (
+    ArrayType,
+    StructType,
+    StructField,
+    StringType,
+    IntegerType,
+    FloatType,
+    MapType,
+)
+from pyspark.sql.functions import count, lit
 from pyspark.testing.connectutils import ReusedConnectTestCase
 
 
 def get_start_event_schema():
     return StructType(
         [
-            StructField("id", StringType(), True),
-            StructField("runId", StringType(), True),
+            StructField("id", StringType(), False),
+            StructField("runId", StringType(), False),
             StructField("name", StringType(), True),
-            StructField("timestamp", StringType(), True),
+            StructField("timestamp", StringType(), False),
         ]
     )

Review Comment:
   I'm thinking of just move these methods to the `QueryxxxEvent`, maybe even create a method that called `asDataFrame`, to save user's effort to create this by themselves. For example, before:
   
   ```
   def get_start_event_schema():
       return StructType(
           [
               StructField("id", StringType(), False),
               StructField("runId", StringType(), False),
               StructField("name", StringType(), True),
               StructField("timestamp", StringType(), False),
           ]
       )
   
   class TestListener(StreamingQueryListener):
       def onQueryStarted(self, event):
           df = self.spark.createDataFrame(
               data=[(event.asDict())],
               schema=get_start_event_schema(),
           )
           df.write.saveAsTable("listener_start_events")
   ```
   Note that this looks simple because I wrote the `asDict` method, and the `get_start_event_schema` method for test. In production, users need to do this themselves. But these are really redundant, if we could add a helper method:
   
   For example:
   
   ```
   class TestListener(StreamingQueryListener):
       def onQueryStarted(self, event):
           df = self.spark.createDataFrame(
               data=[(event.asDict())],
               schema=event.schema(),
           )
           df.write.saveAsTable("listener_start_events")
   
   ============= OR ===============
   class TestListener(StreamingQueryListener):
       def onQueryStarted(self, event):
           df = event.asDataFrame(self.spark)
           df.write.saveAsTable("listener_start_events")
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WweiL commented on a diff in pull request #42521: [SPARK-44435][SS][CONNECT][DRAFT] Tests for foreachBatch and Listener

Posted by "WweiL (via GitHub)" <gi...@apache.org>.
WweiL commented on code in PR #42521:
URL: https://github.com/apache/spark/pull/42521#discussion_r1296451667


##########
python/pyspark/sql/tests/connect/streaming/test_parity_listener.py:
##########
@@ -19,38 +19,153 @@
 import time
 
 from pyspark.sql.tests.streaming.test_streaming_listener import StreamingListenerTestsMixin
-from pyspark.sql.streaming.listener import StreamingQueryListener, QueryStartedEvent
-from pyspark.sql.types import StructType, StructField, StringType
+from pyspark.sql.streaming.listener import (
+    StreamingQueryListener,
+    QueryStartedEvent,
+    QueryProgressEvent,
+    QueryIdleEvent,
+    QueryTerminatedEvent,
+)
+from pyspark.sql.types import (
+    ArrayType,
+    StructType,
+    StructField,
+    StringType,
+    IntegerType,
+    FloatType,
+    MapType,
+)
+from pyspark.sql.functions import count, lit
 from pyspark.testing.connectutils import ReusedConnectTestCase
 
 
 def get_start_event_schema():
     return StructType(
         [
-            StructField("id", StringType(), True),
-            StructField("runId", StringType(), True),
+            StructField("id", StringType(), False),
+            StructField("runId", StringType(), False),
             StructField("name", StringType(), True),
-            StructField("timestamp", StringType(), True),
+            StructField("timestamp", StringType(), False),
         ]
     )

Review Comment:
   I'm thinking of just move these methods to the `QueryxxxEvent`, maybe even create a method that called `asDataFrame`, to save user's effort to create this by themselves. For example, before:
   
   ```
   def get_start_event_schema():
       return StructType(
           [
               StructField("id", StringType(), False),
               StructField("runId", StringType(), False),
               StructField("name", StringType(), True),
               StructField("timestamp", StringType(), False),
           ]
       )
   
   class TestListener(StreamingQueryListener):
       def onQueryStarted(self, event):
           df = self.spark.createDataFrame(
               data=[(event.asDict())],
               schema=get_start_event_schema(),
           )
           df.write.saveAsTable("listener_start_events")
   ```
   Note that this looks simple because I wrote the `asDict` method, and the `get_start_event_schema` method for test. In production, users need to do this themselves. But these are really redundant, if we manage these:
   
   For example:
   
   ```
   class TestListener(StreamingQueryListener):
       def onQueryStarted(self, event):
           df = self.spark.createDataFrame(
               data=[(event.asDict())],
               schema=event.schema(),
           )
           df.write.saveAsTable("listener_start_events")
   
   ============= OR ===============
   class TestListener(StreamingQueryListener):
       def onQueryStarted(self, event):
           df = event.asDataFrame(self.spark)
           df.write.saveAsTable("listener_start_events")
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WweiL commented on a diff in pull request #42521: [SPARK-44435][SS][CONNECT][DRAFT] Tests for foreachBatch and Listener

Posted by "WweiL (via GitHub)" <gi...@apache.org>.
WweiL commented on code in PR #42521:
URL: https://github.com/apache/spark/pull/42521#discussion_r1296451667


##########
python/pyspark/sql/tests/connect/streaming/test_parity_listener.py:
##########
@@ -19,38 +19,153 @@
 import time
 
 from pyspark.sql.tests.streaming.test_streaming_listener import StreamingListenerTestsMixin
-from pyspark.sql.streaming.listener import StreamingQueryListener, QueryStartedEvent
-from pyspark.sql.types import StructType, StructField, StringType
+from pyspark.sql.streaming.listener import (
+    StreamingQueryListener,
+    QueryStartedEvent,
+    QueryProgressEvent,
+    QueryIdleEvent,
+    QueryTerminatedEvent,
+)
+from pyspark.sql.types import (
+    ArrayType,
+    StructType,
+    StructField,
+    StringType,
+    IntegerType,
+    FloatType,
+    MapType,
+)
+from pyspark.sql.functions import count, lit
 from pyspark.testing.connectutils import ReusedConnectTestCase
 
 
 def get_start_event_schema():
     return StructType(
         [
-            StructField("id", StringType(), True),
-            StructField("runId", StringType(), True),
+            StructField("id", StringType(), False),
+            StructField("runId", StringType(), False),
             StructField("name", StringType(), True),
-            StructField("timestamp", StringType(), True),
+            StructField("timestamp", StringType(), False),
         ]
     )

Review Comment:
   In connect mode, this won't work anymore:
   
   ```
   event = None
   
   class MyListener(StreamingQueryListener):
       def onQueryStarted(self, e):
           global event
           event = e
      ....
   
   spark.streams.addListener(MyListener())
   q = spark.readStream.....start()
   q.awaitTermination()
   
   print(event) # Still None on client side, because the code is running on the server
   ```
   
   I assume in connect mode, it would be more common for people to write the listener events directly to tables. Because that's probably one of the only few ways to access these events now in spark connect.
   
   So I'm thinking of just move these methods to the `QueryxxxEvent`, maybe even create a method that called `asDataFrame`, to save user's effort to create this by themselves. For example, before:
   
   ```
   def get_start_event_schema():
       return StructType(
           [
               StructField("id", StringType(), False),
               StructField("runId", StringType(), False),
               StructField("name", StringType(), True),
               StructField("timestamp", StringType(), False),
           ]
       )
   
   class TestListener(StreamingQueryListener):
       def onQueryStarted(self, event):
           df = self.spark.createDataFrame(
               data=[(event.asDict())],
               schema=get_start_event_schema(),
           )
           df.write.saveAsTable("listener_start_events")
   ```
   Note that this looks simple because I wrote the `asDict` method, and the `get_start_event_schema` method for test. In production, users need to do this themselves. But these are really redundant, if we could add a helper method:
   
   For example:
   
   ```
   class TestListener(StreamingQueryListener):
       def onQueryStarted(self, event):
           df = self.spark.createDataFrame(
               data=[(event.asDict())],
               schema=event.schema(), # if we create an asDict method and a schema method for each event
           )
           df.write.saveAsTable("listener_start_events")
   
   ============= OR ===============
   
   class TestListener(StreamingQueryListener):
       def onQueryStarted(self, event):
           df = event.asDataFrame(self.spark) # or we can just create the df for the user
           df.write.saveAsTable("listener_start_events")
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WweiL commented on a diff in pull request #42521: [SPARK-44435][SS][CONNECT] Tests for foreachBatch and Listener

Posted by "WweiL (via GitHub)" <gi...@apache.org>.
WweiL commented on code in PR #42521:
URL: https://github.com/apache/spark/pull/42521#discussion_r1301079336


##########
python/pyspark/sql/tests/connect/streaming/test_parity_listener.py:
##########
@@ -17,40 +17,205 @@
 
 import unittest
 import time
+import uuid
+import json
+from typing import Any, Dict, Union
 
 from pyspark.sql.tests.streaming.test_streaming_listener import StreamingListenerTestsMixin
-from pyspark.sql.streaming.listener import StreamingQueryListener, QueryStartedEvent
-from pyspark.sql.types import StructType, StructField, StringType
+from pyspark.sql.streaming.listener import (
+    StreamingQueryListener,
+    QueryStartedEvent,
+    QueryProgressEvent,
+    QueryIdleEvent,
+    QueryTerminatedEvent,
+    StateOperatorProgress,
+    StreamingQueryProgress,
+    SourceProgress,
+    SinkProgress,
+)
+from pyspark.sql.types import (
+    ArrayType,
+    StructType,
+    StructField,
+    StringType,
+    IntegerType,
+    FloatType,
+    MapType,
+)
+from pyspark.sql import Row
+from pyspark.sql.functions import count, lit
 from pyspark.testing.connectutils import ReusedConnectTestCase
 
 
+def listener_event_as_dict(
+    e: Union[QueryStartedEvent, QueryProgressEvent, QueryIdleEvent, QueryTerminatedEvent]
+) -> Dict[str, Any]:
+    if isinstance(e, QueryProgressEvent):
+        return {"progress": streaming_query_progress_as_dict(e.progress)}
+    else:
+
+        def conv(obj: Any) -> Any:
+            if isinstance(obj, uuid.UUID):
+                return str(obj)
+            else:
+                return obj
+
+        return {k[1:]: conv(v) for k, v in e.__dict__.items()}
+
+
+def streaming_query_progress_as_dict(e: StreamingQueryProgress) -> Dict[str, Any]:

Review Comment:
   Ah thanks! Never thought of that



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WweiL commented on a diff in pull request #42521: [SPARK-44435][SS][CONNECT][DRAFT] Tests for foreachBatch and Listener

Posted by "WweiL (via GitHub)" <gi...@apache.org>.
WweiL commented on code in PR #42521:
URL: https://github.com/apache/spark/pull/42521#discussion_r1296451667


##########
python/pyspark/sql/tests/connect/streaming/test_parity_listener.py:
##########
@@ -19,38 +19,153 @@
 import time
 
 from pyspark.sql.tests.streaming.test_streaming_listener import StreamingListenerTestsMixin
-from pyspark.sql.streaming.listener import StreamingQueryListener, QueryStartedEvent
-from pyspark.sql.types import StructType, StructField, StringType
+from pyspark.sql.streaming.listener import (
+    StreamingQueryListener,
+    QueryStartedEvent,
+    QueryProgressEvent,
+    QueryIdleEvent,
+    QueryTerminatedEvent,
+)
+from pyspark.sql.types import (
+    ArrayType,
+    StructType,
+    StructField,
+    StringType,
+    IntegerType,
+    FloatType,
+    MapType,
+)
+from pyspark.sql.functions import count, lit
 from pyspark.testing.connectutils import ReusedConnectTestCase
 
 
 def get_start_event_schema():
     return StructType(
         [
-            StructField("id", StringType(), True),
-            StructField("runId", StringType(), True),
+            StructField("id", StringType(), False),
+            StructField("runId", StringType(), False),
             StructField("name", StringType(), True),
-            StructField("timestamp", StringType(), True),
+            StructField("timestamp", StringType(), False),
         ]
     )

Review Comment:
   I assume in connect mode, it would be more common for people to write the listener events directly to tables. Because that's probably one of the only few ways to access these events now in spark connect.
   
   So I'm thinking of just move these methods to the `QueryxxxEvent`, maybe even create a method that called `asDataFrame`, to save user's effort to create this by themselves. For example, before:
   
   ```
   def get_start_event_schema():
       return StructType(
           [
               StructField("id", StringType(), False),
               StructField("runId", StringType(), False),
               StructField("name", StringType(), True),
               StructField("timestamp", StringType(), False),
           ]
       )
   
   class TestListener(StreamingQueryListener):
       def onQueryStarted(self, event):
           df = self.spark.createDataFrame(
               data=[(event.asDict())],
               schema=get_start_event_schema(),
           )
           df.write.saveAsTable("listener_start_events")
   ```
   Note that this looks simple because I wrote the `asDict` method, and the `get_start_event_schema` method for test. In production, users need to do this themselves. But these are really redundant, if we could add a helper method:
   
   For example:
   
   ```
   class TestListener(StreamingQueryListener):
       def onQueryStarted(self, event):
           df = self.spark.createDataFrame(
               data=[(event.asDict())],
               schema=event.schema(),
           )
           df.write.saveAsTable("listener_start_events")
   
   ============= OR ===============
   
   class TestListener(StreamingQueryListener):
       def onQueryStarted(self, event):
           df = event.asDataFrame(self.spark)
           df.write.saveAsTable("listener_start_events")
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] bogao007 commented on a diff in pull request #42521: [SPARK-44435][SS][CONNECT][DRAFT] Tests for foreachBatch and Listener

Posted by "bogao007 (via GitHub)" <gi...@apache.org>.
bogao007 commented on code in PR #42521:
URL: https://github.com/apache/spark/pull/42521#discussion_r1298859631


##########
python/pyspark/sql/tests/connect/streaming/test_parity_listener.py:
##########
@@ -19,38 +19,153 @@
 import time
 
 from pyspark.sql.tests.streaming.test_streaming_listener import StreamingListenerTestsMixin
-from pyspark.sql.streaming.listener import StreamingQueryListener, QueryStartedEvent
-from pyspark.sql.types import StructType, StructField, StringType
+from pyspark.sql.streaming.listener import (
+    StreamingQueryListener,
+    QueryStartedEvent,
+    QueryProgressEvent,
+    QueryIdleEvent,
+    QueryTerminatedEvent,
+)
+from pyspark.sql.types import (
+    ArrayType,
+    StructType,
+    StructField,
+    StringType,
+    IntegerType,
+    FloatType,
+    MapType,
+)
+from pyspark.sql.functions import count, lit
 from pyspark.testing.connectutils import ReusedConnectTestCase
 
 
 def get_start_event_schema():
     return StructType(
         [
-            StructField("id", StringType(), True),
-            StructField("runId", StringType(), True),
+            StructField("id", StringType(), False),
+            StructField("runId", StringType(), False),
             StructField("name", StringType(), True),
-            StructField("timestamp", StringType(), True),
+            StructField("timestamp", StringType(), False),
         ]
     )

Review Comment:
   @rangadi are you referring to [this](https://github.com/apache/spark/pull/42563) PR? I think you commented in the wrong PR :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rangadi commented on a diff in pull request #42521: [SPARK-44435][SS][CONNECT][DRAFT] Tests for foreachBatch and Listener

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #42521:
URL: https://github.com/apache/spark/pull/42521#discussion_r1298858164


##########
python/pyspark/sql/tests/connect/streaming/test_parity_listener.py:
##########
@@ -19,38 +19,153 @@
 import time
 
 from pyspark.sql.tests.streaming.test_streaming_listener import StreamingListenerTestsMixin
-from pyspark.sql.streaming.listener import StreamingQueryListener, QueryStartedEvent
-from pyspark.sql.types import StructType, StructField, StringType
+from pyspark.sql.streaming.listener import (
+    StreamingQueryListener,
+    QueryStartedEvent,
+    QueryProgressEvent,
+    QueryIdleEvent,
+    QueryTerminatedEvent,
+)
+from pyspark.sql.types import (
+    ArrayType,
+    StructType,
+    StructField,
+    StringType,
+    IntegerType,
+    FloatType,
+    MapType,
+)
+from pyspark.sql.functions import count, lit
 from pyspark.testing.connectutils import ReusedConnectTestCase
 
 
 def get_start_event_schema():
     return StructType(
         [
-            StructField("id", StringType(), True),
-            StructField("runId", StringType(), True),
+            StructField("id", StringType(), False),
+            StructField("runId", StringType(), False),
             StructField("name", StringType(), True),
-            StructField("timestamp", StringType(), True),
+            StructField("timestamp", StringType(), False),
         ]
     )

Review Comment:
   [Deleted my comment, it was meant for another PR]



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WweiL commented on a diff in pull request #42521: [SPARK-44435][SS][CONNECT][DRAFT] Tests for foreachBatch and Listener

Posted by "WweiL (via GitHub)" <gi...@apache.org>.
WweiL commented on code in PR #42521:
URL: https://github.com/apache/spark/pull/42521#discussion_r1296451667


##########
python/pyspark/sql/tests/connect/streaming/test_parity_listener.py:
##########
@@ -19,38 +19,153 @@
 import time
 
 from pyspark.sql.tests.streaming.test_streaming_listener import StreamingListenerTestsMixin
-from pyspark.sql.streaming.listener import StreamingQueryListener, QueryStartedEvent
-from pyspark.sql.types import StructType, StructField, StringType
+from pyspark.sql.streaming.listener import (
+    StreamingQueryListener,
+    QueryStartedEvent,
+    QueryProgressEvent,
+    QueryIdleEvent,
+    QueryTerminatedEvent,
+)
+from pyspark.sql.types import (
+    ArrayType,
+    StructType,
+    StructField,
+    StringType,
+    IntegerType,
+    FloatType,
+    MapType,
+)
+from pyspark.sql.functions import count, lit
 from pyspark.testing.connectutils import ReusedConnectTestCase
 
 
 def get_start_event_schema():
     return StructType(
         [
-            StructField("id", StringType(), True),
-            StructField("runId", StringType(), True),
+            StructField("id", StringType(), False),
+            StructField("runId", StringType(), False),
             StructField("name", StringType(), True),
-            StructField("timestamp", StringType(), True),
+            StructField("timestamp", StringType(), False),
         ]
     )

Review Comment:
   I assume in connect mode, it would be more common for people to write the listener events directly to tables. Because that's probably one of the only few ways to access these events now in spark connect.
   
   So I'm thinking of just move these methods to the `QueryxxxEvent`, maybe even create a method that called `asDataFrame`, to save user's effort to create this by themselves. For example, before:
   
   ```
   def get_start_event_schema():
       return StructType(
           [
               StructField("id", StringType(), False),
               StructField("runId", StringType(), False),
               StructField("name", StringType(), True),
               StructField("timestamp", StringType(), False),
           ]
       )
   
   class TestListener(StreamingQueryListener):
       def onQueryStarted(self, event):
           df = self.spark.createDataFrame(
               data=[(event.asDict())],
               schema=get_start_event_schema(),
           )
           df.write.saveAsTable("listener_start_events")
   ```
   Note that this looks simple because I wrote the `asDict` method, and the `get_start_event_schema` method for test. In production, users need to do this themselves. But these are really redundant, if we could add a helper method:
   
   For example:
   
   ```
   class TestListener(StreamingQueryListener):
       def onQueryStarted(self, event):
           df = self.spark.createDataFrame(
               data=[(event.asDict())],
               schema=event.schema(), # if we create an asDict method and a schema method for each event
           )
           df.write.saveAsTable("listener_start_events")
   
   ============= OR ===============
   
   class TestListener(StreamingQueryListener):
       def onQueryStarted(self, event):
           df = event.asDataFrame(self.spark) # or we can just create the df for the user
           df.write.saveAsTable("listener_start_events")
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WweiL commented on a diff in pull request #42521: [SPARK-44435][SS][CONNECT][DRAFT] Tests for foreachBatch and Listener

Posted by "WweiL (via GitHub)" <gi...@apache.org>.
WweiL commented on code in PR #42521:
URL: https://github.com/apache/spark/pull/42521#discussion_r1296451667


##########
python/pyspark/sql/tests/connect/streaming/test_parity_listener.py:
##########
@@ -19,38 +19,153 @@
 import time
 
 from pyspark.sql.tests.streaming.test_streaming_listener import StreamingListenerTestsMixin
-from pyspark.sql.streaming.listener import StreamingQueryListener, QueryStartedEvent
-from pyspark.sql.types import StructType, StructField, StringType
+from pyspark.sql.streaming.listener import (
+    StreamingQueryListener,
+    QueryStartedEvent,
+    QueryProgressEvent,
+    QueryIdleEvent,
+    QueryTerminatedEvent,
+)
+from pyspark.sql.types import (
+    ArrayType,
+    StructType,
+    StructField,
+    StringType,
+    IntegerType,
+    FloatType,
+    MapType,
+)
+from pyspark.sql.functions import count, lit
 from pyspark.testing.connectutils import ReusedConnectTestCase
 
 
 def get_start_event_schema():
     return StructType(
         [
-            StructField("id", StringType(), True),
-            StructField("runId", StringType(), True),
+            StructField("id", StringType(), False),
+            StructField("runId", StringType(), False),
             StructField("name", StringType(), True),
-            StructField("timestamp", StringType(), True),
+            StructField("timestamp", StringType(), False),
         ]
     )

Review Comment:
   I assume in connect mode, it would be more common for people to write the listener events directly to tables. Because that's probably one of the only few ways to access these events now in spark connect.
   
   So I'm thinking of just move these methods to the `QueryxxxEvent`, maybe even create a method that called `asDataFrame`, to save user's effort to create this by themselves. For example, before:
   
   ```
   def get_start_event_schema():
       return StructType(
           [
               StructField("id", StringType(), False),
               StructField("runId", StringType(), False),
               StructField("name", StringType(), True),
               StructField("timestamp", StringType(), False),
           ]
       )
   
   class TestListener(StreamingQueryListener):
       def onQueryStarted(self, event):
           df = self.spark.createDataFrame(
               data=[(event.asDict())],
               schema=get_start_event_schema(),
           )
           df.write.saveAsTable("listener_start_events")
   ```
   Note that this looks simple because I wrote the `asDict` method, and the `get_start_event_schema` method for test. In production, users need to do this themselves. But these are really redundant, if we could add a helper method:
   
   For example:
   
   ```
   class TestListener(StreamingQueryListener):
       def onQueryStarted(self, event):
           df = self.spark.createDataFrame(
               data=[(event.asDict())],
               schema=event.schema(), # if we create an asDict method and a schema method for each event
           )
           df.write.saveAsTable("listener_start_events")
   
   ============= OR ===============
   
   class TestListener(StreamingQueryListener):
       def onQueryStarted(self, event):
           df = event.asDataFrame(self.spark)
           df.write.saveAsTable("listener_start_events")
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42521: [SPARK-44435][SS][CONNECT] Tests for foreachBatch and Listener

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #42521:
URL: https://github.com/apache/spark/pull/42521#discussion_r1300995528


##########
python/pyspark/sql/tests/connect/streaming/test_parity_listener.py:
##########
@@ -17,40 +17,205 @@
 
 import unittest
 import time
+import uuid
+import json
+from typing import Any, Dict, Union
 
 from pyspark.sql.tests.streaming.test_streaming_listener import StreamingListenerTestsMixin
-from pyspark.sql.streaming.listener import StreamingQueryListener, QueryStartedEvent
-from pyspark.sql.types import StructType, StructField, StringType
+from pyspark.sql.streaming.listener import (
+    StreamingQueryListener,
+    QueryStartedEvent,
+    QueryProgressEvent,
+    QueryIdleEvent,
+    QueryTerminatedEvent,
+    StateOperatorProgress,
+    StreamingQueryProgress,
+    SourceProgress,
+    SinkProgress,
+)
+from pyspark.sql.types import (
+    ArrayType,
+    StructType,
+    StructField,
+    StringType,
+    IntegerType,
+    FloatType,
+    MapType,
+)
+from pyspark.sql import Row
+from pyspark.sql.functions import count, lit
 from pyspark.testing.connectutils import ReusedConnectTestCase
 
 
+def listener_event_as_dict(
+    e: Union[QueryStartedEvent, QueryProgressEvent, QueryIdleEvent, QueryTerminatedEvent]
+) -> Dict[str, Any]:
+    if isinstance(e, QueryProgressEvent):
+        return {"progress": streaming_query_progress_as_dict(e.progress)}
+    else:
+
+        def conv(obj: Any) -> Any:
+            if isinstance(obj, uuid.UUID):
+                return str(obj)
+            else:
+                return obj
+
+        return {k[1:]: conv(v) for k, v in e.__dict__.items()}
+
+
+def streaming_query_progress_as_dict(e: StreamingQueryProgress) -> Dict[str, Any]:

Review Comment:
   Simpler way might be `pyspark.cloupickle.dumps(event)`, save that as a table, and load it back, and unpickle it via `pyspark.cloudpickle.loads(binary)` and compare them.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WweiL commented on a diff in pull request #42521: [SPARK-44435][SS][CONNECT][DRAFT] Tests for foreachBatch and Listener

Posted by "WweiL (via GitHub)" <gi...@apache.org>.
WweiL commented on code in PR #42521:
URL: https://github.com/apache/spark/pull/42521#discussion_r1296451667


##########
python/pyspark/sql/tests/connect/streaming/test_parity_listener.py:
##########
@@ -19,38 +19,153 @@
 import time
 
 from pyspark.sql.tests.streaming.test_streaming_listener import StreamingListenerTestsMixin
-from pyspark.sql.streaming.listener import StreamingQueryListener, QueryStartedEvent
-from pyspark.sql.types import StructType, StructField, StringType
+from pyspark.sql.streaming.listener import (
+    StreamingQueryListener,
+    QueryStartedEvent,
+    QueryProgressEvent,
+    QueryIdleEvent,
+    QueryTerminatedEvent,
+)
+from pyspark.sql.types import (
+    ArrayType,
+    StructType,
+    StructField,
+    StringType,
+    IntegerType,
+    FloatType,
+    MapType,
+)
+from pyspark.sql.functions import count, lit
 from pyspark.testing.connectutils import ReusedConnectTestCase
 
 
 def get_start_event_schema():
     return StructType(
         [
-            StructField("id", StringType(), True),
-            StructField("runId", StringType(), True),
+            StructField("id", StringType(), False),
+            StructField("runId", StringType(), False),
             StructField("name", StringType(), True),
-            StructField("timestamp", StringType(), True),
+            StructField("timestamp", StringType(), False),
         ]
     )

Review Comment:
   I'm thinking of just move these methods to the `QueryxxxEvent`, maybe even create a method that called `asDataFrame`, to save user's effort to create this by themselves. For example, before:
   
   ```
   def get_start_event_schema():
       return StructType(
           [
               StructField("id", StringType(), False),
               StructField("runId", StringType(), False),
               StructField("name", StringType(), True),
               StructField("timestamp", StringType(), False),
           ]
       )
   
   class TestListener(StreamingQueryListener):
       def onQueryStarted(self, event):
           df = self.spark.createDataFrame(
               data=[(event.asDict())],
               schema=get_start_event_schema(),
           )
           df.write.saveAsTable("listener_start_events")
   ```
   Note that this looks simple because I wrote the `asDict` method, and test `get_start_event_schema` method. In production, users need to do this themselves. But these are really redundant, if we manage these:
   
   For example:
   
   ```
   class TestListener(StreamingQueryListener):
       def onQueryStarted(self, event):
           df = self.spark.createDataFrame(
               data=[(event.asDict())],
               schema=event.schema(),
           )
           df.write.saveAsTable("listener_start_events")
   
   ============= OR ===============
   class TestListener(StreamingQueryListener):
       def onQueryStarted(self, event):
           df = event.asDataFrame(self.spark)
           df.write.saveAsTable("listener_start_events")
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon closed pull request #42521: [SPARK-44435][SS][CONNECT] Tests for foreachBatch and Listener

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon closed pull request #42521: [SPARK-44435][SS][CONNECT] Tests for foreachBatch and Listener
URL: https://github.com/apache/spark/pull/42521


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on pull request #42521: [SPARK-44435][SS][CONNECT] Tests for foreachBatch and Listener

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.
LuciferYang commented on PR #42521:
URL: https://github.com/apache/spark/pull/42521#issuecomment-1691547730

   https://github.com/apache/spark/actions/runs/5962873768/job/16174987432
   
   ```
   Running tests...
   ----------------------------------------------------------------------
   Setting default log level to "WARN".
   To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
   /__w/spark/spark/python/pyspark/sql/connect/session.py:185: UserWarning: [CANNOT_MODIFY_CONFIG] Cannot modify the value of the Spark config: "spark.connect.execute.reattachable.senderMaxStreamDuration".
   See also 'https://spark.apache.org/docs/latest/sql-migration-guide.html#ddl-statements'.
     warnings.warn(str(e))
   /__w/spark/spark/python/pyspark/sql/connect/session.py:185: UserWarning: [CANNOT_MODIFY_CONFIG] Cannot modify the value of the Spark config: "spark.connect.execute.reattachable.senderMaxStreamSize".
   See also 'https://spark.apache.org/docs/latest/sql-migration-guide.html#ddl-statements'.
     warnings.warn(str(e))
   /__w/spark/spark/python/pyspark/sql/connect/session.py:185: UserWarning: [CANNOT_MODIFY_CONFIG] Cannot modify the value of the Spark config: "spark.connect.grpc.binding.port".
   See also 'https://spark.apache.org/docs/latest/sql-migration-guide.html#ddl-statements'.
     warnings.warn(str(e))
     test_listener_events (pyspark.sql.tests.connect.streaming.test_parity_listener.StreamingListenerParityTests) ... Streaming query listener worker is starting with url sc://localhost:43833/;user_id= and sessionId a5a5becc-8da7-4d4b-9a7c-484cd957e3be.
   
   [Stage 0:>                                                          (0 + 1) / 1]
   
   [Stage 0:>                  (0 + 1) / 1][Stage 2:>                  (0 + 1) / 1]
   
                                                                                   
   
   [Stage 0:>                                                          (0 + 1) / 1]
   
                                                                                   
   Traceback (most recent call last):
     File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
       return _run_code(code, main_globals, None,
     File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
       exec(code, run_globals)
     File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/connect/streaming/worker/listener_worker.py", line 99, in <module>
     File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/connect/streaming/worker/listener_worker.py", line 86, in main
     File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/connect/streaming/worker/listener_worker.py", line 77, in process
     File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/streaming/listener.py", line 251, in fromJson
     File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/streaming/listener.py", line 480, in fromJson
   KeyError: 'batchDuration'
   
   [Stage 17:>                                                         (0 + 1) / 1]
   ERROR (46.372s)
   
   ======================================================================
   ERROR [46.372s]: test_listener_events (pyspark.sql.tests.connect.streaming.test_parity_listener.StreamingListenerParityTests)
   ----------------------------------------------------------------------
   Traceback (most recent call last):
     File "/__w/spark/spark/python/pyspark/sql/tests/connect/streaming/test_parity_listener.py", line 80, in test_listener_events
       self.spark.read.table("listener_progress_events").collect()[0][0]
     File "/__w/spark/spark/python/pyspark/sql/connect/dataframe.py", line 1645, in collect
       table, schema = self._session.client.to_table(query)
     File "/__w/spark/spark/python/pyspark/sql/connect/client/core.py", line 833, in to_table
       table, schema, _, _, _ = self._execute_and_fetch(req)
     File "/__w/spark/spark/python/pyspark/sql/connect/client/core.py", line 1257, in _execute_and_fetch
       for response in self._execute_and_fetch_as_iterator(req):
     File "/__w/spark/spark/python/pyspark/sql/connect/client/core.py", line 1238, in _execute_and_fetch_as_iterator
       self._handle_error(error)
     File "/__w/spark/spark/python/pyspark/sql/connect/client/core.py", line 1477, in _handle_error
       self._handle_rpc_error(error)
     File "/__w/spark/spark/python/pyspark/sql/connect/client/core.py", line 1513, in _handle_rpc_error
       raise convert_exception(info, status.message) from None
   pyspark.errors.exceptions.connect.AnalysisException: [TABLE_OR_VIEW_NOT_FOUND] The table or view `listener_progress_events` cannot be found. Verify the spelling and correctness of the schema and catalog.
   If you did not qualify the name with a schema, verify the current_schema() output, or qualify the name with the correct schema and catalog.
   To tolerate the error on drop use DROP VIEW IF EXISTS or DROP TABLE IF EXISTS.;
   'UnresolvedRelation [listener_progress_events], [], false
   
   
   ----------------------------------------------------------------------
   Ran 1 test in 55.000s
   
   FAILED (errors=1)
   
   Generating XML reports...
   Generated XML report: target/test-reports/TEST-pyspark.sql.tests.connect.streaming.test_parity_listener.StreamingListenerParityTests-20230824115154.xml
   
   Had test failures in pyspark.sql.tests.connect.streaming.test_parity_listener with python3.9; see logs.
   Error:  running /__w/spark/spark/python/run-tests --modules=pyspark-connect --parallelism=1 ; received return code 255
   Error: Process completed with exit code 19.
   ```
   
   @WweiL Are there any related PRs that have not been merged into branch-3.5? The branch-3.5 daily test failed today.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #42521: [SPARK-44435][SS][CONNECT] Tests for foreachBatch and Listener

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #42521:
URL: https://github.com/apache/spark/pull/42521#issuecomment-1692141836

   Hi, all. I also saw the consecutive failures at three commits after this. Let me revert this from branch-3.5 first.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42521: [SPARK-44435][SS][CONNECT][DRAFT] Tests for foreachBatch and Listener

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #42521:
URL: https://github.com/apache/spark/pull/42521#discussion_r1299493549


##########
python/pyspark/sql/tests/connect/streaming/test_parity_listener.py:
##########
@@ -19,38 +19,153 @@
 import time
 
 from pyspark.sql.tests.streaming.test_streaming_listener import StreamingListenerTestsMixin
-from pyspark.sql.streaming.listener import StreamingQueryListener, QueryStartedEvent
-from pyspark.sql.types import StructType, StructField, StringType
+from pyspark.sql.streaming.listener import (
+    StreamingQueryListener,
+    QueryStartedEvent,
+    QueryProgressEvent,
+    QueryIdleEvent,
+    QueryTerminatedEvent,
+)
+from pyspark.sql.types import (
+    ArrayType,
+    StructType,
+    StructField,
+    StringType,
+    IntegerType,
+    FloatType,
+    MapType,
+)
+from pyspark.sql.functions import count, lit
 from pyspark.testing.connectutils import ReusedConnectTestCase
 
 
 def get_start_event_schema():
     return StructType(
         [
-            StructField("id", StringType(), True),
-            StructField("runId", StringType(), True),
+            StructField("id", StringType(), False),
+            StructField("runId", StringType(), False),
             StructField("name", StringType(), True),
-            StructField("timestamp", StringType(), True),
+            StructField("timestamp", StringType(), False),
         ]
     )

Review Comment:
   Yeah we can change like that - I guess that's the only way to fix the test?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org