You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/07/30 23:29:58 UTC

[GitHub] [spark] HeartSaVioR edited a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

HeartSaVioR edited a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-666770668

> for exactly-once semantics, we can add make changes in ManifestFileCommitter to delete files in the abort function. Or we can come up with some other alternatives.

I already provided the change (Spark 3.0.0 if I remember correctly), but this is only "best-effort". You cannot deal with crashing scenario.

Given we directly write the file to the final path, worth noting that reading output files in directory directly doesn't only mean you're reading duplicated outputs (at least once). This also means there's a chance you may be reading incomplete/corrupted files as well, in some sort of crashing during write.

Stepping back to explain the rationalization and the goal - providing holistic solution is not a goal.

There're already lots of efforts being made to provide holistic solution, though these efforts are happening "outside" of Spark codebase, Delta Lake, Apache Iceberg, Apache Hudi, and probably more. I'd just reinvent the wheel if I try to address the entire problems, which I can't persuade anyone to provide me enough time to work.
(Please refer the comment which is the feedback I got - https://github.com/apache/spark/pull/27694#issuecomment-651454246)

My goal of the overall improvements in file stream souce/sink is, enabling end users run the query a bit longer without weird issues (like OOM). I just had to take easier (and limited) approach to solve the issue. e.g. regarding growing entries issue, while alternatives support data compaction to reduce the overall number of files, I proposed retention of the output files so that older files can be excluded in metadata log.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org