You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/02/20 03:20:53 UTC

[GitHub] HeartSaVioR edited a comment on issue #23840: [SPARK-24295][SS] Add option to retain only last batch in file stream sink metadata

HeartSaVioR edited a comment on issue #23840: [SPARK-24295][SS] Add option to retain only last batch in file stream sink metadata
URL: https://github.com/apache/spark/pull/23840#issuecomment-465405043
 
 
   In practice, end users would have policy for data retention, and output files could be removed based on the policy. So it would be ideal if metadata can be reflected on the change of output files, but in point of Spark's view it doesn't look like easy to do. For example, if we go on checking existence of files in metadata list periodically (maybe each X batches to avoid concurrent modification), it will be another huge overhead to slow down. Specifying retention policy in Spark query (which files will be removed outside of Spark) is also really odd, so neither is beauty.
   
   If it's OK for file stream sink to periodically check existence of files and get rid of removed files in file log (less side effect but not sure about performance), I'll apply the change.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org