You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2019/11/19 20:26:03 UTC

[GitHub] [incubator-iceberg] rdblue commented on issue #179: Use Iceberg tables as sources for Spark Structured Streaming

rdblue commented on issue #179: Use Iceberg tables as sources for Spark Structured Streaming
URL: https://github.com/apache/incubator-iceberg/issues/179#issuecomment-555696946
 
 
   > we should be able to stream out all currently present data in addition to what will arrive later.
   
   Agreed, but I think it would be easier to start from a particular snapshot for now, and add the ability to process all existing data as a follow-up.
   
   > we need to take into account that files in the current snapshot could be added by already expired snapshots and we might not have metadata for those snapshots.
   
   Yes. To start with, I'd recommend ordering partition tuples and processing a partition at a time. Partitions are often time-based and correlated with the write pattern, so this approach will probably provide a smooth transition to from partitions to snapshots.
   
   > We can have a config per operation or a list of allowed operations.
   
   This proposal sounds good to me. For overwrite, it seems like we will need a delta format to express what happened to a row as well as to pass the row contents.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org