You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/02/14 20:35:39 UTC

[GitHub] HeartSaVioR commented on issue #22282: [SPARK-23539][SS] Add support for Kafka headers in Structured Streaming

HeartSaVioR commented on issue #22282: [SPARK-23539][SS] Add support for Kafka headers in Structured Streaming
URL: https://github.com/apache/spark/pull/22282#issuecomment-463784685
 
 
   Adding new column(s) in DataSource might open the chance to crash on existing query for SS, when new columns are propagated to state schema. For example, like @zsxwing explained, `dropDuplicates()` (note: no explicit columns provided) will leverage `all columns`, so unless the query has projection to select columns explicitly before calling `dropDuplicates()`, the state schema will be changed and makes incompatibility.
   
   Even worse, if my understanding is right (please correct me if I'm missing here), error message would not be informative: there's no check on state schema compatibility as we don't store schema of state explicitly (that's one of my radar on future contribution) and it might not just crash but might show undefined behavior.
   
   Once we don't make this optional and bring the change to default, sadly there's no workaround and they will be required to fix their query (via adding all columns except header in `dropDuplicates()` or adding select before it) to keep using their existing state.
   
   Unfortunately, in Structured Streaming, existing state does really matter and we need to concern of it every time. (Even when we deal for batch query it should be also considered as a view of Structured Streaming so that it doesn't break streaming query as a side effect.)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org