You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/06/23 20:23:09 UTC

[GitHub] [iceberg] jackye1995 commented on issue #2723: Support for low latency writes to iceberg table

jackye1995 commented on issue #2723:
URL: https://github.com/apache/iceberg/issues/2723#issuecomment-867132431


   Adding responses from mailing list:
   
   @flyrain 
   For 3, it may not be worth adding extra complexity by introducing a "change set", unless we get solid data that shows writing a "change set" is faster than a complete rewrite. 
   
   @jackye1995 
   I think the summary looks mostly correct, where we already can do "low latency" append and delete by primary key, or delete with any predicate that resolves to finitely many equality constraints without the need to scan all rows.
   
   Secondary index would be useful for scan, which is subsequently used by the generation of delete files for complex delete predicates. This is still in design, we should probably push progress in this domain a bit more because I haven't heard any update since a few months ago:
   - https://docs.google.com/document/d/1E1ofBQoKRnX04bWT3utgyHQGaHZoelgXosk_UNsTUuQ
   - https://docs.google.com/document/d/11o3T7XQVITY_5F9Vbri9lF9oJjDZKjHIso7K8tEaFfY
   
   For metadata, I agree with Yufei that a change set approach sounds like quite a lot of extra complexity just to keep the commit phase a little bit faster. I understand that your single commit time is in the end bounded by the time you rewrite that file, but this almost sounds like we want to do merge-on-read even for metadata, which would be cool but likely an overkill for most users. In my mind, the metadata layer should be managed by the admin constantly, and it is much simpler to just periodically optimize the number of manifests in order to control the size of the manifest list file during each commit, which would also benefit scan planning performance at the same time. I am curious about your "low latency" requirement, do you have any actual numbers you need to hit, if you could share them here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org