You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/07/30 09:11:28 UTC

[GitHub] [spark] itsvikramagr commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

itsvikramagr commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-666248758

@HeartSaVioR - This is a much-needed fix. Thanks for it.

I have an orthogonal question. Why do we need to worry about compacting the file sink metadata? I can think of following reasons
- the downstream read operations can read the compacted metadata file to list all committed files. So they can avoid the listing cost and also improve performance
- Helps in exactly-once semantics. On task failures, we don't have to worry about deleting any files written.

If the compacted metadata file size is running into GBs, the number of valid files would be in millions. In practice, the end-user will consider this sink path as a staging location and have another job to compact these small files into a final destination.

for exactly-once semantics, we can add make changes in ManifestFileCommitter to delete files in the abort function. Or we can come up with some other alternatives.

In short, if we provide an option just to have last few commits in sink metadata to ensure SS is not impacted. And make changes in various readers not to read using metadata log files. Won't it help in ensuring the reliability of the streaming job?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org