You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/09/14 12:11:47 UTC

[GitHub] [spark] HeartSaVioR edited a comment on pull request #29715: [WIP][SPARK-32847][SS] Add DataStreamWriterV2 API

HeartSaVioR edited a comment on pull request #29715:
URL: https://github.com/apache/spark/pull/29715#issuecomment-692011667

Thanks for the input.

My initial goal was to enable reading catalog table in SS query, so I didn't touch the other stuff from DataStreamWriter. I borrowed the concept of representing "save mode" as the end chain of the method, but that's also OK for me if we'd like to keep the `start()` as DataStreamWriter does. If you have some idea in mind about the problem, please feel free to share.

I think there're some points to consider while designing:

1. The output mode for the sink isn't exactly matched to the output mode for the result table.

We already know about the case of "update as append" (output mode for the result table is update but the sink does the append) for DSv2, but in reality, most built-in sinks are doing the append for any mode (even complete mode), just because we did for Spark 2.x. DSv1 is even more problematic, the interface is designed to only append, but there's no limitation of the output mode for DSv1 sink.

I think we won't support DSv1 in DataStreamWriterV2, but mismatch still remains in DSv2. Do we want to keep the mismatch forever, or fix it at least in DSv2? (Kafka is an one of examples - Kafka sink shouldn't allow update and complete mode. I think we did the right fix but the compatibility messed up.)

2. The continuous mode hasn't been actively developed.

Given the current status of SS development, I don't think continuous mode would leverage the output mode in near future. (That said, output mode is not needed.) I'm not sure that will be valid in near future - if it is, we may be able to split builders for micro-batch and continuous mode and remove output mode for continuous mode.

(TBH, I'm wondering continuous mode is being used in production - the mode is introduced in Spark 2.3, and no one has been claimed to graduate continuous mode from experimental. No contributor has been caring about it. Is that something we might be able to consider retiring to reduce complexity?)

3. More things to consider?

Without the clear answer on considerations it would be hard to construct a good API.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org