You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by "rdblue (via GitHub)" <gi...@apache.org> on 2023/04/23 18:13:36 UTC

[GitHub] [iceberg] rdblue commented on pull request #7194: Core, AWS: Auto optimize table using post commit notifications

rdblue commented on PR #7194:
URL: https://github.com/apache/iceberg/pull/7194#issuecomment-1519124565

I'm going to close this PR because I don't think it is an approach that makes sense for the Iceberg project.

One of the reasons why Iceberg exists is because it is important to solve problems in the right place. Before, we needed to solve problems in the processing engine or in a file format, and those created awkward, half-baked solutions. Similarly, I think that this is not the right place or a good approach for optimization.

First, the ideal approach is to write data correctly in the first place. That's why Iceberg defines table-level tuning settings and write order, and why we request distribution and ordering in engines like Spark. We want to be able to asynchronously optimize tables, but we don't want to require it if we don't need to. Focusing effort on fixing the underlying problem (creating too many files) is a better approach. I think we should see if we can address the problem in the write path by coalescing outputs and aligning write distribution with table partitioning.

Second, kicking off a job in a specific downstream job through an API intended to collect metrics is not a good design for asynchronous optimization. Quite a few comments question aspects of this. Those are valid concerns. But ignoring the specifics, I think that the choices here were made because this is attempting to solve a problem in the wrong place. Rather than going that direction and then getting pulled deeper into a mess -- adding more compute options or rules for how to take action -- I think the right approach is to have APIs that enable people to build optimizers, similar to how we handle catalogs. That's why we built metrics reporting as an API: to get important information to downstream systems.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org