You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/03/02 06:33:06 UTC

[GitHub] [spark] aokolnychyi opened a new pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

aokolnychyi opened a new pull request #31700:
URL: https://github.com/apache/spark/pull/31700


   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   
   This PR shows one potential way to support required distribution and ordering in SS, which is marked as a blocker for 3.2.
   
   **Important!** This change does not try to address other relevant TODO items like refreshing the cache or performing more checks using the capability API.
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   These changes are needed so that data sources can request a specific distribution and ordering not only for batch but also for micro-batch writes.
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   No.
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   
   This PR adds additional tests.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] aokolnychyi commented on pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #31700:
URL: https://github.com/apache/spark/pull/31700#issuecomment-788659440


   I'd appreciate your feedback, @tdas @jose-torres @brkyvz.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a change in pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #31700:
URL: https://github.com/apache/spark/pull/31700#discussion_r587970348



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala
##########
@@ -166,6 +166,52 @@ object OverwritePartitionsDynamic {
   }
 }
 
+case class AppendMicroBatch(

Review comment:
       Streaming write semantic is not same as batch one. The semantic is bound to the stateful operation; there should be only `append`, `update` (not same as overwrite), and `truncate and append (complete)`. For update we haven't constructed the proper way to define it.
   
   The major concern is that the group keys in stateful operation must be used as keys in update mode. That is currently not possible, but Spark has been dealing with update with the huge risk that we're doing the same as append, and the risk is delegated to the sink (or user). The sink or user has to deal with reflecting the appended output as "upsert". That's why I renamed `SupportsStreamingUpdate` as `SupportsStreamingUpdateAsAppend` to clarify the behavior.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] aokolnychyi commented on a change in pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #31700:
URL: https://github.com/apache/spark/pull/31700#discussion_r585299930



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/connector/WriteDistributionAndOrderingSuite.scala
##########
@@ -474,6 +537,33 @@ class WriteDistributionAndOrderingSuite
   }
 
   private def checkWriteRequirements(
+      tableDistribution: Distribution,
+      tableOrdering: Array[SortOrder],
+      expectedWritePartitioning: physical.Partitioning,
+      expectedWriteOrdering: Seq[catalyst.expressions.SortOrder],
+      writeTransform: DataFrame => DataFrame = df => df,
+      writeCommand: String): Unit = {
+
+    if (writeCommand.startsWith(microBatchPrefix)) {

Review comment:
       I wrote it this way to minimize changes.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31700:
URL: https://github.com/apache/spark/pull/31700#issuecomment-788675731


   **[Test build #135629 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135629/testReport)** for PR 31700 at commit [`6581d7b`](https://github.com/apache/spark/commit/6581d7b632c5615664e8bf035bf867fae8c72b4d).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31700:
URL: https://github.com/apache/spark/pull/31700#issuecomment-788684771


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40209/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31700:
URL: https://github.com/apache/spark/pull/31700#discussion_r587452749



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala
##########
@@ -166,6 +166,52 @@ object OverwritePartitionsDynamic {
   }
 }
 
+case class AppendMicroBatch(

Review comment:
       I don't think we have finalized the streaming writing semantic yet. Ideally, it should be similar to batch (upsert semantic), but we are not at that point yet.
   
   I think we should standardize the streaming writing query plan later. For now, let's just use one query plan with `OutputMode` as the parameter.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #31700:
URL: https://github.com/apache/spark/pull/31700#issuecomment-790605892


   Micro-batch is an implementation detail and shouldn't be used to define the semantic.
   
   If a streaming sink has a distribution requirement, I can understand it and the data should be properly partitioned before entering the streaming sink. However, if a streaming sink has an ordering requirement, I can't imagine how a data stream can satisfy an ordering requirement.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a change in pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #31700:
URL: https://github.com/apache/spark/pull/31700#discussion_r587970348



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala
##########
@@ -166,6 +166,52 @@ object OverwritePartitionsDynamic {
   }
 }
 
+case class AppendMicroBatch(

Review comment:
       Streaming write semantic is not same as batch one. The semantic is bound to the stateful operation; there should be only append, update, and truncate and append (complete), and for update we haven't constructed the proper way to define it.
   
   The major concern is that the group keys in stateful operation must be used as keys in update mode. That is currently not possible, but Spark has been dealing with update with the huge risk that we're doing the same as append, and the risk is delegated to the sink (or user). The sink or user has to deal with reflecting the appended output as "upsert". That's why I renamed `SupportsStreamingUpdate` as `SupportsStreamingUpdateAsAppend` to clarify the behavior.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31700:
URL: https://github.com/apache/spark/pull/31700#issuecomment-790011236


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40301/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31700:
URL: https://github.com/apache/spark/pull/31700#issuecomment-790129959


   **[Test build #135719 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135719/testReport)** for PR 31700 at commit [`e7f167f`](https://github.com/apache/spark/commit/e7f167fb0103d7e62127421cf5ad2a594d5f4337).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31700:
URL: https://github.com/apache/spark/pull/31700#issuecomment-788696041


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135629/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] aokolnychyi commented on pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #31700:
URL: https://github.com/apache/spark/pull/31700#issuecomment-788658247


   This PR needs more refinement but I would like to get some early feedback from folks working on SS before investing more time. There are quite some TODO items around the capability API so I am not sure what was the original plan. 
   
   I guess this covers SPARK-27484.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] aokolnychyi commented on pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #31700:
URL: https://github.com/apache/spark/pull/31700#issuecomment-790940796


   If there is no clarity on how individual plans should look like in SS, I can adapt the existing `WriteToMicroBatchDataSource`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] aokolnychyi commented on a change in pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #31700:
URL: https://github.com/apache/spark/pull/31700#discussion_r588010745



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala
##########
@@ -166,6 +166,52 @@ object OverwritePartitionsDynamic {
   }
 }
 
+case class AppendMicroBatch(

Review comment:
       Thanks for the context, @HeartSaVioR!
   
   I think we have two ways to proceed:
   
   Option 1: Just adapt `WriteToMicroBatchDataSource ` to use the `Write` abstraction and handle it in `V2Writes`.
   Option 2: Define specific plans where we have clarity. For example, `append` and `complete` seem well-defined. We could define plans like `AppendStreamingData` and `TruncateAndAppendStreamingData` or anything like that and have something intermediate for `update`.
   
   I am fine either way but option 1 seems easier for this PR. The rest can be covered by SPARK-27484.
   
   To start with, we should all agree this feature is useful for micro-batch streaming.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] aokolnychyi commented on a change in pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #31700:
URL: https://github.com/apache/spark/pull/31700#discussion_r585299686



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala
##########
@@ -566,7 +580,9 @@ class MicroBatchExecution(
     val triggerLogicalPlan = sink match {
       case _: Sink => newAttributePlan
       case _: SupportsWrite =>
-        newAttributePlan.asInstanceOf[WriteToMicroBatchDataSource].createPlan(currentBatchId)
+        newAttributePlan transform {
+          case w: V2MicroBatchWriteCommand => w.withNewBatchId(currentBatchId)

Review comment:
       Seems safe?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] aokolnychyi commented on pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #31700:
URL: https://github.com/apache/spark/pull/31700#issuecomment-790180126


   All tests are green so this PR is ready for the first review round.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31700:
URL: https://github.com/apache/spark/pull/31700#issuecomment-788696041


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135629/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR edited a comment on pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

Posted by GitBox <gi...@apache.org>.
HeartSaVioR edited a comment on pull request #31700:
URL: https://github.com/apache/spark/pull/31700#issuecomment-791149996


   Actually that's the one of few advantages from micro-batch compared to record-to-record, and we already leveraged it by some public API (e.g. flatMapGroupsWithState - this "sorts" the inputs in specific micro-batch so that values from the same group can be served sequentially). 
   
   That said, I'm supportive on the concept of the ordering, only for micro-batch. Dealing with sort in continuous mode is quite tricky - due to the nature of record-to-record processing, sort requires to buffer inputs into state or somewhere in memory until the epoch has been finished (we can maintain the state or buffer be kept to be sorted though), and downstream operations can only continue their works after that, which contradicts the fact that epoch is finished.
   
   My 2 cents on continuous mode is that we'd be better to admit the architectural differences between the batch oriented and streaming oriented, and try to have some safe approach to isolate between twos. Naturally integrating twos sounds very hard to achieve, and even has been playing as roadblock for improving functionalities on micro-batch mode as well.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31700:
URL: https://github.com/apache/spark/pull/31700#issuecomment-788694799


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40209/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR edited a comment on pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

Posted by GitBox <gi...@apache.org>.
HeartSaVioR edited a comment on pull request #31700:
URL: https://github.com/apache/spark/pull/31700#issuecomment-791149996


   Actually that's the one of few advantages from micro-batch compared to record-to-record, and we already leveraged it by some public API (e.g. flatMapGroupsWithState - this "sorts" the inputs in specific micro-batch so that values from the same group can be served to the user func sequentially wrapped with iterator). 
   
   That said, I'm supportive on the concept of the ordering, only for micro-batch. Dealing with sort in continuous mode is quite tricky - due to the nature of record-to-record processing, sort requires to buffer inputs into state or somewhere in memory until the epoch has been finished (we can maintain the state or buffer be kept to be sorted though), and downstream operations can only continue their works after that, which contradicts the fact that epoch is finished.
   
   My 2 cents on continuous mode is that we'd be better to admit the architectural differences between the batch oriented and streaming oriented, and try to have some safe approach to isolate between twos. Naturally integrating twos sounds very hard to achieve, and even has been playing as roadblock for improving functionalities on micro-batch mode as well.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] aokolnychyi commented on pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #31700:
URL: https://github.com/apache/spark/pull/31700#issuecomment-788656950


   cc @dongjoon-hyun @viirya @sunchao @cloud-fan @HyukjinKwon @HeartSaVioR @holdenk @huaxingao @xuanyuanking
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31700:
URL: https://github.com/apache/spark/pull/31700#issuecomment-790130907


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135719/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] closed pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

Posted by GitBox <gi...@apache.org>.
github-actions[bot] closed pull request #31700:
URL: https://github.com/apache/spark/pull/31700


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31700:
URL: https://github.com/apache/spark/pull/31700#issuecomment-789983065


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40301/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] aokolnychyi edited a comment on pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

Posted by GitBox <gi...@apache.org>.
aokolnychyi edited a comment on pull request #31700:
URL: https://github.com/apache/spark/pull/31700#issuecomment-788656950


   cc @dongjoon-hyun @viirya @sunchao @cloud-fan @HyukjinKwon @rdblue @HeartSaVioR @holdenk @huaxingao @xuanyuanking
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] aokolnychyi commented on pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #31700:
URL: https://github.com/apache/spark/pull/31700#issuecomment-790935772


   @cloud-fan, I am not sure about the continuous mode, but I think there is a valid use case for micro-batch streaming. The required distribution and ordering apply to individual writes so it does not mean the underlying sink is globally ordered.
   
   For example, let's say we are writing to a partitioned file sink. If we just group incoming data by partition, a single output task may still have records for multiple partitions. A naive sink implementation may close the current file and open a new one each time it sees records for another partition, producing a large number of files. An alternative implementation can keep multiple files open. That's not ideal too as we increase memory consumption. That's why ordering data within a task by partition seems like a good default for micro-batch streaming.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31700:
URL: https://github.com/apache/spark/pull/31700#issuecomment-790130907


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135719/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] aokolnychyi commented on a change in pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #31700:
URL: https://github.com/apache/spark/pull/31700#discussion_r586641888



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -224,6 +224,8 @@ class Dataset[T] private[sql](
     // For various commands (like DDL) and queries with side effects, we force query execution
     // to happen right away to let these side effects take place eagerly.
     val plan = queryExecution.analyzed match {
+      case _: V2MicroBatchWriteCommand =>

Review comment:
       This bit is needed to prevent executing each batch early. We used `WriteToDataSourceV2` before, which did not extend `Command`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR edited a comment on pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

Posted by GitBox <gi...@apache.org>.
HeartSaVioR edited a comment on pull request #31700:
URL: https://github.com/apache/spark/pull/31700#issuecomment-791149996


   Actually that's the one of few advantages from micro-batch compared to record-to-record, and we already leveraged it by some public API (e.g. flatMapGroupsWithState - this "sorts" the inputs in specific micro-batch so that values from the same group can be served sequentially). 
   
   That said, I'm supportive on the concept of the ordering, only for micro-batch. Dealing with sort in continuous mode is quite tricky - due to the nature of record-to-record processing, sort requires to buffer inputs into state or somewhere in memory until the epoch has been finished (we can maintain the state or buffer be kept to be sorted though), and downstream operations can only continue their works, which contradicts the fact that epoch is finished.
   
   My 2 cents on continuous mode is that we'd be better to admit the architectural differences between the batch oriented and streaming oriented, and try to have some safe approach to isolate between twos. Naturally integrating twos sounds very hard to achieve, and even has been playing as roadblock for improving functionalities on micro-batch mode as well.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31700:
URL: https://github.com/apache/spark/pull/31700#issuecomment-789951567


   **[Test build #135719 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135719/testReport)** for PR 31700 at commit [`e7f167f`](https://github.com/apache/spark/commit/e7f167fb0103d7e62127421cf5ad2a594d5f4337).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] aokolnychyi commented on pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #31700:
URL: https://github.com/apache/spark/pull/31700#issuecomment-872407269


   Okay, I'll update the PR by the end of the week and then we can decide whether it is something we want to have in 3.2.0. I am fine not including this change but the feature we release in 3.2 won't be complete in that case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #31700:
URL: https://github.com/apache/spark/pull/31700#issuecomment-791170887


   ok at least we should document this clearly in `RequiresDistributionAndOrdering`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] aokolnychyi commented on a change in pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #31700:
URL: https://github.com/apache/spark/pull/31700#discussion_r587811639



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala
##########
@@ -166,6 +166,52 @@ object OverwritePartitionsDynamic {
   }
 }
 
+case class AppendMicroBatch(

Review comment:
       Oh, I thought only update mode was under discussion. Are overwrite and append modes under discussion?
   
   I saw @HeartSaVioR's PR to rename `SupportsStreamingUpdateAsAppend`. Is there a discussion I can take a look at?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] aokolnychyi commented on a change in pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #31700:
URL: https://github.com/apache/spark/pull/31700#discussion_r587811639



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala
##########
@@ -166,6 +166,52 @@ object OverwritePartitionsDynamic {
   }
 }
 
+case class AppendMicroBatch(

Review comment:
       Oh, I thought only update mode was under discussion. Are overwrite and append modes under discussion too?
   
   I saw @HeartSaVioR's PR to rename `SupportsStreamingUpdateAsAppend`. Is there a discussion I can take a look at?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] aokolnychyi commented on pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #31700:
URL: https://github.com/apache/spark/pull/31700#issuecomment-870063410


   @cloud-fan @HyukjinKwon @HeartSaVioR @dongjoon-hyun @viirya @sunchao, SPARK-34183 is marked as a blocker for 3.2.0. I can update this PR but I'll need input on open questions.
   
   I guess the primary discussion spot is [here](https://github.com/apache/spark/pull/31700#discussion_r588010745). I understand the streaming plans may not be ready. If so, I propose to just extend the existing micro-batch plans with the distribution and ordering capabilities.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31700:
URL: https://github.com/apache/spark/pull/31700#issuecomment-790011236


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40301/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] aokolnychyi commented on a change in pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #31700:
URL: https://github.com/apache/spark/pull/31700#discussion_r587811639



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala
##########
@@ -166,6 +166,52 @@ object OverwritePartitionsDynamic {
   }
 }
 
+case class AppendMicroBatch(

Review comment:
       Oh, I thought only update mode was under discussion. Does overwrite and append have the same issues?
   
   I saw @HeartSaVioR's PR to rename `SupportsStreamingUpdateAsAppend`. Is there a discussion I can take a look at?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a change in pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #31700:
URL: https://github.com/apache/spark/pull/31700#discussion_r587970348



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala
##########
@@ -166,6 +166,52 @@ object OverwritePartitionsDynamic {
   }
 }
 
+case class AppendMicroBatch(

Review comment:
       Streaming write semantic is not same as batch one. The semantic is bound to the stateful operation; there should be only `append`, `update` (not same as overwrite), and `truncate and append (complete)`. For update we haven't constructed the proper way to define it.
   
   The major concern is that the group keys in stateful operation must be used as keys in update mode. That is currently not possible (there are some sketched ideas on this though), but Spark has been dealing with update with the huge risk that we're doing the same as append, and the risk is delegated to the sink (or user). The sink or user has to deal with reflecting the appended output as "upsert". That's why I renamed `SupportsStreamingUpdate` as `SupportsStreamingUpdateAsAppend` to clarify the behavior.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] aokolnychyi commented on pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #31700:
URL: https://github.com/apache/spark/pull/31700#issuecomment-789932144


   After this change, we no longer use `WriteToDataSourceV2`, which was deprecated in 2.4.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #31700:
URL: https://github.com/apache/spark/pull/31700#issuecomment-871950154


   Sorry I forgot this one. Given the fact that we don't have streaming plans and there's no plan to address this, I'm +1 to make existing micro-batch plan to allow distribution and ordering capabilities. Even better if we can make it fail on continuous mode with proper explanation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31700:
URL: https://github.com/apache/spark/pull/31700#issuecomment-789951567


   **[Test build #135719 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135719/testReport)** for PR 31700 at commit [`e7f167f`](https://github.com/apache/spark/commit/e7f167fb0103d7e62127421cf5ad2a594d5f4337).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] aokolnychyi commented on pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #31700:
URL: https://github.com/apache/spark/pull/31700#issuecomment-870063410


   @cloud-fan @HyukjinKwon @HeartSaVioR @dongjoon-hyun @viirya @sunchao, SPARK-34183 is marked as a blocker for 3.2.0. I can update this PR but I'll need input on open questions.
   
   I guess the primary discussion spot is [here](https://github.com/apache/spark/pull/31700#discussion_r588010745). I understand the streaming plans may not be ready. If so, I propose to just extend the existing micro-batch plans with the distribution and ordering capabilities.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #31700:
URL: https://github.com/apache/spark/pull/31700#issuecomment-791149996


   Actually that's the one of few advantages from micro-batch compared to record-to-record, and we already leveraged it by some public API (e.g. flatMapGroupsWithState - this "sorts" the inputs so that values from the same group can be served sequentially). 
   
   That said, I'm supportive on the concept of the ordering, only for micro-batch. Dealing with sort in continuous mode is quite tricky - despite of the nature of record-to-record processing, sort requires to buffer inputs into state or somewhere in memory until the epoch has been finished (we can maintain the state or buffer be kept to be sorted though), and downstream operations can only continue their works, which contradicts the fact that epoch is finished.
   
   My 2 cents on continuous mode is that we'd be better to admit the architectural differences between the batch oriented and streaming oriented, and try to have some safe approach to isolate between twos. Naturally integrating twos sounds very hard to achieve, and even has been playing as roadblock for improving functionalities on micro-batch mode as well.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] aokolnychyi commented on pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #31700:
URL: https://github.com/apache/spark/pull/31700#issuecomment-790939311


   > Micro-batch is an implementation detail and shouldn't be used to define the semantic.
   
   That's a fair argument. I'd like to hide it as well. Do you have a good idea, @cloud-fan? I was inspired by the existing code where we had `WriteToContinuousDataSource` and `WriteToMicroBatchDataSource`. We probably need a way to distinguish continuous and micro-batch writes (does not have to be a separate plan, though).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31700:
URL: https://github.com/apache/spark/pull/31700#issuecomment-788694799


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40209/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31700:
URL: https://github.com/apache/spark/pull/31700#issuecomment-788689928


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40209/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan edited a comment on pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

Posted by GitBox <gi...@apache.org>.
cloud-fan edited a comment on pull request #31700:
URL: https://github.com/apache/spark/pull/31700#issuecomment-791136860


   Maybe we can define required ordering as "sort the data within each epoch". For batch, an epoch is an entire partition. For micro-batch, an epoch is a micro-batch. For continuous, we already have the epoch semantic. @HeartSaVioR how do you think about it?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #31700:
URL: https://github.com/apache/spark/pull/31700#issuecomment-791136860


   Maybe we can define required ordering as "sort the data per epoch". For batch, an epoch is an entire partition. For micro-batch, an epoch is a micro-batch. For continuous, we already have the epoch semantic. @HeartSaVioR how do you think about it?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31700:
URL: https://github.com/apache/spark/pull/31700#issuecomment-790002130


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40301/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR edited a comment on pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

Posted by GitBox <gi...@apache.org>.
HeartSaVioR edited a comment on pull request #31700:
URL: https://github.com/apache/spark/pull/31700#issuecomment-791149996


   Actually that's the one of few advantages from micro-batch compared to record-to-record, and we already leveraged it by some public API (e.g. flatMapGroupsWithState - this "sorts" the inputs in specific micro-batch so that values from the same group can be served to the user func sequentially wrapped with iterator. Imagine how it could be done without sorting.)
   
   That said, I'm supportive on the concept of the ordering, only for micro-batch. Dealing with sort in continuous mode is quite tricky - due to the nature of record-to-record processing, sort requires to buffer inputs into state or somewhere in memory until the epoch has been finished (we can maintain the state or buffer be kept to be sorted though), and downstream operations can only continue their works after that, which contradicts the fact that epoch is finished.
   
   My 2 cents on continuous mode is that we'd be better to admit the architectural differences between the batch oriented and streaming oriented, and try to have some safe approach to isolate between twos. Naturally integrating twos sounds very hard to achieve, and even has been playing as roadblock for improving functionalities on micro-batch mode as well.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sunchao commented on pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

Posted by GitBox <gi...@apache.org>.
sunchao commented on pull request #31700:
URL: https://github.com/apache/spark/pull/31700#issuecomment-871946711


   > I guess the primary discussion spot is here. I understand the streaming plans may not be ready. If so, I propose to just extend the existing micro-batch plans with the distribution and ordering capabilities.
   
   +1 on this given that there are still open questions, so it's better not to introduce new APIs at this point. The new approach looks pretty straightforward too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] commented on pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #31700:
URL: https://github.com/apache/spark/pull/31700#issuecomment-860291250


   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR edited a comment on pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

Posted by GitBox <gi...@apache.org>.
HeartSaVioR edited a comment on pull request #31700:
URL: https://github.com/apache/spark/pull/31700#issuecomment-791149996


   Actually that's the one of few advantages from micro-batch compared to record-to-record, and we already leveraged it by some public API (e.g. flatMapGroupsWithState - this "sorts" the inputs in specific micro-batch so that values from the same group can be served sequentially). 
   
   That said, I'm supportive on the concept of the ordering, only for micro-batch. Dealing with sort in continuous mode is quite tricky - despite of the nature of record-to-record processing, sort requires to buffer inputs into state or somewhere in memory until the epoch has been finished (we can maintain the state or buffer be kept to be sorted though), and downstream operations can only continue their works, which contradicts the fact that epoch is finished.
   
   My 2 cents on continuous mode is that we'd be better to admit the architectural differences between the batch oriented and streaming oriented, and try to have some safe approach to isolate between twos. Naturally integrating twos sounds very hard to achieve, and even has been playing as roadblock for improving functionalities on micro-batch mode as well.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org