You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/01/25 07:46:05 UTC

[GitHub] [iceberg] davseitsev commented on pull request #2660: Spark3 structured streaming micro_batch read support

davseitsev commented on pull request #2660:
URL: https://github.com/apache/iceberg/pull/2660#issuecomment-1020896240


   @SreeramGarlapati  thank you very much for your work. I have a few questions.
   
   We have Spark streaming job which reads data from Kafka, process it and store to iceberg table partitioned by day.
   There is a background compaction process and scheduled cleanup task that expires old snapshots to remove old small files.
   I want to build another streaming job that reads a few tables produced by first job, unions them, filters necessary rows and stores data to iceberg table.
   Thus I'd like to understand better how expired snapshots are handled.
   
   In our case source table contains `append` and `replace` snapshots.
   `MicroBatchStream.initialOffset()` always returns `StreamingOffset` with `scanAllFiles=true` to process historical data. As old snapshots are expired by cleanup process we can get into the case when first snapshot is of type `replace`.  Due to #2752 we ignore `replace` snapshots. Will it lead to the situation when we skip initial snapshot with `scanAllFiles=true` and loose all data appended in old (expired) snapshots?
   
   And one more question.
   Let's say have data in source table for 1 year, expire snapshots older than 7 days and cleanup job runs every 1 hour.
   If a job starts reading this table from initial offset it has at most 1 hour to process first snapshot, doesn't it? As initial snapshot is processed with `scanAllFiles=true`, it's the biggest one because it contains data for 1 year minus 7 days. If I'm correct, there is a big chance that streaming job will fail in the middle when cleanup job runs. Because it will expire the snapshot which is processed by streaming job.
   Would it work if `initialOffset()` returns latest snapshot with flag `scanAllFiles` instead of first snapshot?:
   - `latestOffset()` -> [LatestSnapshot, LatestFile, false]
   - `initialOffset()` -> [LatestSnapshot, 0, true]
   Probably in this case we should have 7 days to process initial snapshot.
   
   Are there any other corner cases with snapshots expiration?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org