You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/12/13 22:10:38 UTC

[GitHub] [iceberg] mtnrabi opened a new issue, #6422: How compaction works along side incremental read

mtnrabi opened a new issue, #6422:
URL: https://github.com/apache/iceberg/issues/6422

   ### Query engine
   
   Spark
   
   ### Question
   
   In the docs, it’s mentioned that incremental read “Currently gets only the data from append operation. Cannot support replace, overwrite, delete operations.”
   
   When compacting (re-write files procedure), will incremental read use the compacted files and snapshots?
   Or will it use the old snapshots and files?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] mrabisf commented on issue #6422: How compaction works along side incremental read

Posted by "mrabisf (via GitHub)" <gi...@apache.org>.

mrabisf commented on issue #6422:
URL: https://github.com/apache/iceberg/issues/6422#issuecomment-1518979122

   Real thanks for the detailed explanation @ChristinaTech.
   
   So, since incremental read doesn't use compacted files but the old files, would you suggest using incremental read only on rather small batches?
   
   Otherwise you suggest querying on partition time column with something like BETWEEN / > statement? Would that be of better performance since those queries use the compacted files.
   
   I'm just trying to understand the best practice for doing a hopping window on an Iceberg table.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] ChristinaTech commented on issue #6422: How compaction works along side incremental read

Posted by "ChristinaTech (via GitHub)" <gi...@apache.org>.

ChristinaTech commented on issue #6422:
URL: https://github.com/apache/iceberg/issues/6422#issuecomment-1524591137

   So just because incremental read can't use compacted data, doesn't necessarily mean you can't use incremental reads, just that it would potentially be less efficient. How inefficient incremental reads would be would depend largely on how efficiently the data was stored when it was first appended to the Iceberg table. 
   
   While querying on a partitioned time column would have the advantage of using compacted files where available, it would come with the disadvantage of having to deal with missing or duplicate results due to any data being ingested late. This could of course be worked around by making sure you waited until you were certain no more data would arrive for a given time range before querying on it but that would increase the delay between data arriving and actually being processed.
   
   Since my last comment I have spent a bit of time thinking on how incremental reads could be improved to use compacted data and I am going to open a separate issue to track some potential improvements to this process sometime in the coming days.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] ChristinaTech commented on issue #6422: How compaction works along side incremental read

Posted by "ChristinaTech (via GitHub)" <gi...@apache.org>.

ChristinaTech commented on issue #6422:
URL: https://github.com/apache/iceberg/issues/6422#issuecomment-1518090607

   At present, Incremental Read will use the old snapshots and files. The primary limiting factor lies in the fact that `replace` snapshots, which add and remove data files without changing the actual data and are what rewrite procedures use, do not keep close track of what files were used to create what other files and how.
   
   This means that, even if support were added for interpreting `replace` snapshots as is, their replacement files could only be used if every file removed by the replace was included in the interval of the incremental read.
   
   This could be moderately improved if `replace` snapshots stored a map of what specific files were used in the creation of what other files, but even then it still wouldn't be helpful in a lot of cases, as rewrite by default will generally end up merging files from inside the incremental read interval with files from outside the incremental read interval.
   
   I will note that it would be beneficial if Iceberg could support this behavior, as it would help mitigate the performance impact of micro-batch file ingestion on incremental reads that take place after compaction. Need to spend some time brainstorming technical solutions to the problems preventing this from happening.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org