You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/01/13 02:38:24 UTC

[GitHub] [iceberg] electrum commented on issue #2068: Procedure for adding files to a Table

electrum commented on issue #2068:
URL: https://github.com/apache/iceberg/issues/2068#issuecomment-759165160

This feature sounds troublesome, as it seems to mix two requirements:

* Import existing files without **rewriting** them, because writing and storing multiple copies is expensive.
* Import existing files without **reading** them, because reading is expensive.

Users constantly run into issues due to metadata/schema in the Hive world, due to a combination of the Hive metadata model and because they are using different tools to write and manage files and metadata. One of the things I love about Iceberg is that it has a strong specification for metadata and tools to enforce it. Not having all sorts of random software writing raw files to disk in different ways is a feature.

I strongly oppose anything that makes it easy for users to create broken tables. Anything that is *"up to the user"* to get right is broken by design in my opinion. Users will always get it wrong. I wouldn't even trust myself to get it right. That's why our software has extensive tests and verification checks at runtime.

Reading data is not particularly expensive. You're presumably converting the data to Iceberg to query it, so query it once up front to make sure it's right. If you hide it behind an advanced, *"go fast flag for people who know what they're doing"*, guess what, everyone wants to go fast and thinks they know what they're doing.

Getting metadata wrong doesn't just cause future queries to fail. If the stats or partitioning are wrong, the queries can silently return wrong answers. In my opinion, that's the worst thing a data system can do.

Speaking of stats, another great feature of Iceberg compared to Hive is that we can guarantee that we have stats, and that they're correct. At a minimum, we have to read the file footers to get the stats.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org