You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/02/28 02:22:26 UTC

[GitHub] [spark] HeartSaVioR commented on pull request #31638: [SPARK-34526][SS] Add a flag to skip checking file sink format and handle glob path

HeartSaVioR commented on pull request #31638:
URL: https://github.com/apache/spark/pull/31638#issuecomment-787221374


   The behavioral change is bound to file data source, right? I prefer adding the source option instead of adding config, because 1) Spark has a bunch of configurations already 2) I'd prefer having smaller range of impact, per source instead of session-wide.
   
   And also, the option should be used carefully (at least we'd be better to indicate to end users), as the metadata in file stream sink output ensures the "correctly written" files are only read. What would happen if they ignore it?
   
   1) files not listed in metadata could be corrupted, and these files are now being read. They may need to also turn on the flag "ignore corrupted" as well.
   
   2) They are no longer able to consider the output directory as "exactly once" - that is "at least one", meaning the output rows can be written multiple times. Without deduplication or consideration of logic on read side, it may result to incorrect output on read side.
   
   Technically this is an existing issue when batch query tries to read from multiple directories or glob path and it includes the output directory on file stream sink. (That said, they could use the glob path as a workaround without adding the new configuration, though I'd agree explicit config is more intuitive.) I'd love to see the proper notice on such case as well since we are here.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org