You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/01/11 23:04:19 UTC

[GitHub] [iceberg] RussellSpitzer opened a new issue #2068: Procedure for adding files to a Table

RussellSpitzer opened a new issue #2068:
URL: https://github.com/apache/iceberg/issues/2068


   Internally we have a command "Import Data" which allows a user to associate a directory of files with a given Iceberg table. This is generally used when someone wants to add a number of files directly to an Iceberg table without rewriting all the data, they just want the existing files (usually from a a Hive table or another format) to just be added to the existing table.
   
   I was thinking of adding a similar procedure to Iceberg if that would be of value to other users?
   
   This is also in the vain of exposing more Iceberg internal functionality like "addFile" directly to Spark SQL ...
   
   cc @aokolnychyi 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on issue #2068: Procedure for adding files to a Table

Posted by GitBox <gi...@apache.org>.

aokolnychyi commented on issue #2068:
URL: https://github.com/apache/iceberg/issues/2068#issuecomment-774242073


   I think @RussellSpitzer's implementation in PR #2210 is a pretty reasonable example of what we can do. We need to improve the code in `SparkTableUtil` to also perform extra checks while reading the footer, but it seems independent from the PR.
   
   There is one point we should discuss on this thread. Should we allow importing a path which forces us to infer partitioning or should we always assume we import a single partition?  More details are in [this](https://github.com/apache/iceberg/pull/2210/files#r571191846) comment.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on issue #2068: Procedure for adding files to a Table

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on issue #2068:
URL: https://github.com/apache/iceberg/issues/2068#issuecomment-759628824


   I should also note a workaround that i've given for users trying to do something similar to this is to create a Hive External Table referring to the files, then to migrate that into an iceberg table.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on issue #2068: Procedure for adding files to a Table

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on issue #2068:
URL: https://github.com/apache/iceberg/issues/2068#issuecomment-758851376


   IMHO, I think you should do an explicit delete or drop partition if you want that to happen but I don't have strong feelings about that. I'm not a big fan of the INSERT OVERWRITE behavior


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on issue #2068: Procedure for adding files to a Table

Posted by GitBox <gi...@apache.org>.

jackye1995 commented on issue #2068:
URL: https://github.com/apache/iceberg/issues/2068#issuecomment-759207669

> One of the things I love about Iceberg is that it has a strong specification for metadata and tools to enforce it. Not having all sorts of random software writing raw files to disk in different ways is a feature.

+1 for the comment. Internally we have an operation called bootstrap that serves this purpose, and it is always only done by a very small set of people who know exactly what is going on. The data quality guaranteed by Iceberg is a very valuable aspect. But it is always tempting to have these features that are very fast and useful by taking a shortcut.

On the other hand, I do see the value of this, especially for people who want a quick try of Iceberg with an existing set of files. There are also some legit use cases where a set of data files cannot be rewritten and has to be added in this way. So if we think this is a valuable thing to add, this will definitely need a lot of explicit documentation and warning. Maybe a post-operation verification can be performed by reading certain amount of imported data to verify the table continues to work. If the verification fails, it is at least a simple command to rollback to the previous version without going too far.

@RussellSpitzer something I do not fully understand is that, how does the procedure know the partition of each file? Does it read a line of the file to figure out? Does it assume a Hive layout to try map the path name into partition tuples?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi edited a comment on issue #2068: Procedure for adding files to a Table

Posted by GitBox <gi...@apache.org>.

aokolnychyi edited a comment on issue #2068:
URL: https://github.com/apache/iceberg/issues/2068#issuecomment-758583064


   I think it is particularly interesting with CREATE TABLE LIKE or similar commands. It is kind of close to IMPORT of Hive partitions. We can follow that and support appending or overriding all files in a path. If we want, we can make this procedure to accept a single file too (but I don't think it is a common use case).
   
   @RussellSpitzer, could you suggest how this procedure should look like? What input and output params will it have?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer edited a comment on issue #2068: Procedure for adding files to a Table

Posted by GitBox <gi...@apache.org>.

RussellSpitzer edited a comment on issue #2068:
URL: https://github.com/apache/iceberg/issues/2068#issuecomment-773014658


   Pull request added in case anyone wants an example to think about


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi edited a comment on issue #2068: Procedure for adding files to a Table

Posted by GitBox <gi...@apache.org>.

aokolnychyi edited a comment on issue #2068:
URL: https://github.com/apache/iceberg/issues/2068#issuecomment-758853336


   There is some sense in offering `add_files` and `replace_partitions` (or whatever we want to call them) too.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on issue #2068: Procedure for adding files to a Table

Posted by GitBox <gi...@apache.org>.

aokolnychyi commented on issue #2068:
URL: https://github.com/apache/iceberg/issues/2068#issuecomment-774208295


   I finally found time to get through recent comments. I believe the discussion is moving in the right direction. 
   
   > 1. Add name mapping to the Iceberg spec so that it is well-defined and we have test cases to validate
   > 2. Document how name mappings change when a schema evolves (allows adding aliases)
   
   +1. We have tested how name mappings behave when a schema is evolved (e.g. simple cases like adding columns in the middle). That seems to work fine even when the imported files don't have column ids.
   
   > 3. Make sure that when we import files, there is a name mapping set for the table.
   
   I think we are already doing it in the snapshot and migrate procedures for Spark. If we are to add `add_files`, we can assign a default name mapping too.
   
   > 4. Build correct metadata from imported files based on the name mapping.
   
   We try to respect name mappings while importing files in Spark (not sure about Trino) but we trust the name mapping is correct.
   
   > 5. Identify problems with the name mapping, like files with no readable / mapped fields or incompatible data types
   
   It sounds like we should enhance our import code that currently reads footers to import stats to also perform schema checks?
    


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi edited a comment on issue #2068: Procedure for adding files to a Table

Posted by GitBox <gi...@apache.org>.

aokolnychyi edited a comment on issue #2068:
URL: https://github.com/apache/iceberg/issues/2068#issuecomment-758850624


   > I'm also not sure what overwrite would do. Dynamic partition overwrite?
   
   Yeah, the use case for this may be as follows: the user imports a partition and then more files are added to that partition through a non-Iceberg path. So he/she may want to reimport the partition overriding the current set of files in that partition. 
   
   It may be used for snapshot tables or tables to which files are promoted after they are written. For example, we have a customer who writes Parquet files using Java and at some point registers them as a partition.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi closed issue #2068: Procedure for adding files to a Table

Posted by GitBox <gi...@apache.org>.

aokolnychyi closed issue #2068:
URL: https://github.com/apache/iceberg/issues/2068


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on issue #2068: Procedure for adding files to a Table

Posted by GitBox <gi...@apache.org>.

aokolnychyi commented on issue #2068:
URL: https://github.com/apache/iceberg/issues/2068#issuecomment-758583064


   I think it is particularly interesting with CREATE TABLE LIKE or similar commands. It is kind of close to IMPORT of Hive partitions. We can follow that and support appending or overriding all files in a path. If we want, we can make this procedure to accept a single file too.
   
   @RussellSpitzer, could you suggest how this procedure should look like? What input and output params will it have?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer edited a comment on issue #2068: Procedure for adding files to a Table

Posted by GitBox <gi...@apache.org>.

RussellSpitzer edited a comment on issue #2068:
URL: https://github.com/apache/iceberg/issues/2068#issuecomment-759628824


   I should also note a workaround that I've given for users trying to do something similar to this is to create a Hive External Table referring to the files, then to migrate that into an iceberg table.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on issue #2068: Procedure for adding files to a Table

Posted by GitBox <gi...@apache.org>.

rdblue commented on issue #2068:
URL: https://github.com/apache/iceberg/issues/2068#issuecomment-780992492


   I think that we should only infer partitioning when we know that we are converting from a Hive table with identity-partition columns. Since that is a "well-known" format, it should be somewhat reliable. Random paths in a file system are more risky to parse for table data. I could be convinced otherwise, but it seems like a stretch to match an Iceberg table's partitioning to paths.
   
   +1 to checking file footers before importing. The files should not have IDs in the schemas and we should make sure that the schemas can be converted to something readable using the name mapping.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] electrum commented on issue #2068: Procedure for adding files to a Table

Posted by GitBox <gi...@apache.org>.

electrum commented on issue #2068:
URL: https://github.com/apache/iceberg/issues/2068#issuecomment-759165160

This feature sounds troublesome, as it seems to mix two requirements:

* Import existing files without **rewriting** them, because writing and storing multiple copies is expensive.
* Import existing files without **reading** them, because reading is expensive.

Users constantly run into issues due to metadata/schema in the Hive world, due to a combination of the Hive metadata model and because they are using different tools to write and manage files and metadata. One of the things I love about Iceberg is that it has a strong specification for metadata and tools to enforce it. Not having all sorts of random software writing raw files to disk in different ways is a feature.

I strongly oppose anything that makes it easy for users to create broken tables. Anything that is *"up to the user"* to get right is broken by design in my opinion. Users will always get it wrong. I wouldn't even trust myself to get it right. That's why our software has extensive tests and verification checks at runtime.

Reading data is not particularly expensive. You're presumably converting the data to Iceberg to query it, so query it once up front to make sure it's right. If you hide it behind an advanced, *"go fast flag for people who know what they're doing"*, guess what, everyone wants to go fast and thinks they know what they're doing.

Getting metadata wrong doesn't just cause future queries to fail. If the stats or partitioning are wrong, the queries can silently return wrong answers. In my opinion, that's the worst thing a data system can do.

Speaking of stats, another great feature of Iceberg compared to Hive is that we can guarantee that we have stats, and that they're correct. At a minimum, we have to read the file footers to get the stats.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on issue #2068: Procedure for adding files to a Table

Posted by GitBox <gi...@apache.org>.

aokolnychyi commented on issue #2068:
URL: https://github.com/apache/iceberg/issues/2068#issuecomment-758737314


   Thoughts on this one, @kbendick @rdblue @raptond @RussellSpitzer @holdenk @rymurr @shardulm94?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] electrum commented on issue #2068: Procedure for adding files to a Table

Posted by GitBox <gi...@apache.org>.

electrum commented on issue #2068:
URL: https://github.com/apache/iceberg/issues/2068#issuecomment-781040286


   One more thing to add to the spec is that identity columns don’t need to be present in the data file, since that is the case for Hive files. We support this in Trino, but I don’t think it or any other migration scenario is tested.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on issue #2068: Procedure for adding files to a Table

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on issue #2068:
URL: https://github.com/apache/iceberg/issues/2068#issuecomment-765736742


   I'm going to finish up a conversion of our internal method and post it up for review, we can talk more about safety checks and utility once it's sketched up a bit more I think.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on issue #2068: Procedure for adding files to a Table

Posted by GitBox <gi...@apache.org>.

rdblue commented on issue #2068:
URL: https://github.com/apache/iceberg/issues/2068#issuecomment-773662070


   Just catching up on this thread.
   
   I agree with everything that @electrum said. Strong expectations help us. But we also need to have a way to handle existing data. That comes up all the time. Most existing data is tracked by name.
   
   To support existing data files, we built name mappings so that you can take table data that previously identified columns by name and attach IDs. As long as the files that used name-based schema evolution are properly mapped, the Iceberg table can carry on with id-based resolution without problems. I think this is a reasonable path forward to get schema IDs.
   
   Position-based schema evolution isn't very popular and doesn't work well with nested structures, so I think we should focus on name-based.
   
   I completely agree that we need to read part of each data file. For Parquet, we need to get column stats at a minimum, but we should also validate that at least one column is readable.
   
   I think that means that we have a few things to do to formally support this:
   1. Add name mapping to the Iceberg spec so that it is well-defined and we have test cases to validate
   2. Document how name mappings change when a schema evolves (allows adding aliases)
   3. Make sure that when we import files, there is a name mapping set for the table
   4. Build correct metadata from imported files based on the name mapping
   5. Identify problems with the name mapping, like files with no readable / mapped fields or incompatible data types
   
   One last issue to note is how to get the partition for a data file. The current import code assumes that files are coming from a Hive table layout and converts the path to partition values as Hive would. We will need to make sure that an import procedure has a plan for handling this.
   
   And I should note that only Hive partition paths can be parsed. Iceberg purposely makes no guarantee that partition values can be recovered from paths and considers partition values to path conversion to be one-way. So we may want to have a way to pass the partition tuple. A struct is one option, but that makes it especially hard to use the date transforms. Another option is to import everything at the top level, or to try to infer values from lower/upper bounds in column stats.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on issue #2068: Procedure for adding files to a Table

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on issue #2068:
URL: https://github.com/apache/iceberg/issues/2068#issuecomment-758740890


   > I agree with most of your points, @RussellSpitzer.
   > 
   > I'd probably consider naming it `import_files` or `import_data` or `load_data` and add a flag whether it should overwrite the existing data or not.
   > 
   > ```
   > cat.system.import_data(table => 'db.tbl', path => 'path/to/dir or file', format => 'orc', overwrite => true) 
   > ```
   > 
   > Then `path` can be either a root table location, a partition path or a path to an individual file (if we want to support that).
   
   I was wondering around overwrite, I feel like it would be safer to do that in two user operations rather than having this operation delete information. Since a user may not know which partitions will be touched in a given import


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on issue #2068: Procedure for adding files to a Table

Posted by GitBox <gi...@apache.org>.

aokolnychyi commented on issue #2068:
URL: https://github.com/apache/iceberg/issues/2068#issuecomment-774209863


   Also, I'd probably limit `add_files` to importing only existing files, not something one should do with files written through Iceberg. The primary use case is probably CREATE TABLE LIKE and then calling `add_files` for a couple of partitions to try reads/writes or migrating partition by partition if a table has too many partitions to migrate them at once. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer edited a comment on issue #2068: Procedure for adding files to a Table

Posted by GitBox <gi...@apache.org>.

RussellSpitzer edited a comment on issue #2068:
URL: https://github.com/apache/iceberg/issues/2068#issuecomment-758704781


   I was thinking something like
   
   **add_files**.(**table** => "db.sample", **directory** => "/some/dir/files/", **format** => "parquet")
   
   **Table** = "The iceberg table we are adding files to"
   **Directory** = "The source directory where the files are located"
   **Format** = "The reader used with these files (parquet, orc, avro ...)
   
   I believe the output should be something like
   
   | Partition | Files Added |
   | --------------  | ------------|
   | "Foo", "1"          | 5                 |
   | "Foo", "2"         |  6               |
   
   This command would basically go through all of the files under the provided directory, and add them all directly to the table. The files would not be read of scanned so it would be up to the user to only add files with correct columns or to adjust the table column mapping to match. 
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on issue #2068: Procedure for adding files to a Table

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on issue #2068:
URL: https://github.com/apache/iceberg/issues/2068#issuecomment-773014658


   Pull request added in case anyone want's an example to think about


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] electrum commented on issue #2068: Procedure for adding files to a Table

Posted by GitBox <gi...@apache.org>.

electrum commented on issue #2068:
URL: https://github.com/apache/iceberg/issues/2068#issuecomment-759169871


   Another concern is that the migrated files will violate the Iceberg specification, since they won't have the required Iceberg column IDs, and may also be missing the identity partitioning columns. If migrated files are officially supported in Iceberg, then exactly what is allowed and how to read them should be covered by the specification.
   
   We have this problem today for Hive migrated tables. They are, at least in theory, supported in Trino (formerly PrestoSQL), but as second class citizen, since there are no tests or specification to know it's correct. I only know about this because @Parth-Brahmbhatt originally wrote the connector based on Netflix's internal usage and told me about how it's supposed to work. Any future Iceberg implementations are likely to miss this detail.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on issue #2068: Procedure for adding files to a Table

Posted by GitBox <gi...@apache.org>.

jackye1995 commented on issue #2068:
URL: https://github.com/apache/iceberg/issues/2068#issuecomment-759646229


   > We currently use the SparkTableUtil's listPartition function. It has custom code for parquet, orc, and avro and uses the directory structure to determine which files belong to which partitions. After that it's just a getFile, table.append.appendFile. Internally we also provide an option for specifying a specific partition to import.
   
   Thanks for the explanation, this makes sense to me now. In this case, I think it is also valuable to expose this option to specify a specific partition in this procedure.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rymurr commented on issue #2068: Procedure for adding files to a Table

Posted by GitBox <gi...@apache.org>.

rymurr commented on issue #2068:
URL: https://github.com/apache/iceberg/issues/2068#issuecomment-759466345

I agree with @electrum that a huge strength of iceberg is its strong specification, I would be reluctant to do anything that could weaken that or give users a footgun wrt to metadata strength. Likewise I agree that reading the files is probably required to do this safely.

However, we are currently working on our own version of the 'make these files an iceberg table' function. It sounds like several of these already exist in other places. From the comments this is being driven by the desire to avoid copying files around, especially on S3. Our use case is the same and the conversion to iceberg will be done primarily by users (as opposed to a super-user).

I would be interested in helping to derive/implement a spec that defines the canonical Iceberg approach to importing a set of files without moving/copying/renaming (I guess this is similar to the MIGRATE Spark action?). At the very least this is likely going to have to read the footers and to verify the schema and partition spec. I see a lot of value in this function working (at least partially) across engines so that all users/systems can take advantage of it.

cc @vvellanki

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on issue #2068: Procedure for adding files to a Table

Posted by GitBox <gi...@apache.org>.

aokolnychyi commented on issue #2068:
URL: https://github.com/apache/iceberg/issues/2068#issuecomment-758850624


   > I'm also not sure what overwrite would do. Dynamic partition overwrite?
   
   Yeah, the use case for this may be as follows: the user imports a partition and then more files are added to that partition through a non-Iceberg path. So he/she may want to reimport the partition overriding the current set of files in that partition. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on issue #2068: Procedure for adding files to a Table

Posted by GitBox <gi...@apache.org>.

kbendick commented on issue #2068:
URL: https://github.com/apache/iceberg/issues/2068#issuecomment-759109163


   Actually, thinking about it further, reading just the footers wouldn't handle the case for partition spec updates if the table has several partition specs. So it does seem that users would either need to ensure on their own, or that we'd have to parse the values from path strings to at least ensure that the paths appear to be partitioned in a similar manner and then leaving it up to the user to be sure that all of the data matches the partition that it's in.
   
   I think it's a useful option to support, even if it depends on users ensuring the data is correctly partitioned etc.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on issue #2068: Procedure for adding files to a Table

Posted by GitBox <gi...@apache.org>.

rdblue commented on issue #2068:
URL: https://github.com/apache/iceberg/issues/2068#issuecomment-758829503


   I like the idea of importing data files using a stored procedure. We would need to be careful about partition specs because we do not parse values from path strings in most cases. We do it for importing from Hive tables, but we should not allow it for importing from another Iceberg table.
   
   I'm also not sure what `overwrite` would do. Dynamic partition overwrite?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on issue #2068: Procedure for adding files to a Table

Posted by GitBox <gi...@apache.org>.

kbendick commented on issue #2068:
URL: https://github.com/apache/iceberg/issues/2068#issuecomment-759107248

I'm in support of this. I have several Airflow DAGs that write parquet files with python etc that need to be added to tables, like @aokolnychyi mentioned.

> This command would basically go through all of the files under the provided directory, and add them all directly to the table. The files would not be read of scanned so it would be up to the user to only add files with correct columns or to adjust the table column mapping to match.

I do somewhat worry about not having the ability to perform checks on the files to ensure that they have the correct columns and abide by the partition spec. Particularly I would worry about people importing data that is partitioned by say `date` but the table spec has multiple partition specs, on other columns etc. I don't currently have a strong opinion about this either way, but it would seem beneficial to have something similar to spark's `spark.sql.parquet.mergeSchema`. Meaning that it would be nice to optionally allow for the file footers to be read and either update the table schema or error out if the files are incompatible. Though I guess that's already offered via a full import and most likely the use case for this would be something like importing from a directory that's already partitioned by date etc.

I know the parquet merge schemas is a rather expensive option and not typically used, but there are some clusters that I've helped administer where setting that by default is reasonable for that cluster's users if they've historically run into issues with changing schemas in their Java / Python process without updating the metastore.

But I suppose this would probably be something that users would have on a regular schedule such as the case of writing parquet files from other tools (where the schema doesn't typically change very often), or would be a one off thing where hopefully the users know what they're doing. At the least, logging the appropriate warnings would be important.

So long as users could roll back, then I don't have a strong opinion about supporting the option to verify schema compatibility. Anybody who is truly concerned about that should be retaining a long enough snapshot history to rollback in that case.

I'm also +1 on not deleting files in the same operation, as that does seem likely to cause somebody data loss.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on issue #2068: Procedure for adding files to a Table

Posted by GitBox <gi...@apache.org>.

kbendick commented on issue #2068:
URL: https://github.com/apache/iceberg/issues/2068#issuecomment-758550459


   This sounds particularly important to users on S3, especially users with older data that do not necessarily want to commit to migrating all of their older files (and having to copy them, which can get quite expensive). And given that there are no renames without copying on S3, many users are hesitant to move large parts of their data lake around.
   
   I think this could help drive adoption of Iceberg for users with current Hive warehouses on S3.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on issue #2068: Procedure for adding files to a Table

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on issue #2068:
URL: https://github.com/apache/iceberg/issues/2068#issuecomment-758704781


   I was thinking something like
   
   **add_files**.(**table** => "db.sample", **directory** => "/some/dir/files/", **reader** => "parquet")
   
   **Table **= "The iceberg table we are adding files to"
   **Directory** = "The source directory where the files are located"
   **Format** = "The reader used with these files (parquet, orc, avro ...)
   
   I believe the output should be something like
   
   | Partition | Files Added |
   | --------------  | ------------|
   | "Foo", "1"          | 5                 |
   | "Foo", "2"         |  6               |
   
   This command would basically go through all of the files under the provided directory, and add them all directly to the table. The files would not be read of scanned so it would be up to the user to only add files with correct columns or to adjust the table column mapping to match. 
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on issue #2068: Procedure for adding files to a Table

Posted by GitBox <gi...@apache.org>.

rdblue commented on issue #2068:
URL: https://github.com/apache/iceberg/issues/2068#issuecomment-773662070


   Just catching up on this thread.
   
   I agree with everything that @electrum said. Strong expectations help us. But we also need to have a way to handle existing data. That comes up all the time. Most existing data is tracked by name.
   
   To support existing data files, we built name mappings so that you can take table data that previously identified columns by name and attach IDs. As long as the files that used name-based schema evolution are properly mapped, the Iceberg table can carry on with id-based resolution without problems. I think this is a reasonable path forward to get schema IDs.
   
   Position-based schema evolution isn't very popular and doesn't work well with nested structures, so I think we should focus on name-based.
   
   I completely agree that we need to read part of each data file. For Parquet, we need to get column stats at a minimum, but we should also validate that at least one column is readable.
   
   I think that means that we have a few things to do to formally support this:
   1. Add name mapping to the Iceberg spec so that it is well-defined and we have test cases to validate
   2. Document how name mappings change when a schema evolves (allows adding aliases)
   3. Make sure that when we import files, there is a name mapping set for the table
   4. Build correct metadata from imported files based on the name mapping
   5. Identify problems with the name mapping, like files with no readable / mapped fields or incompatible data types
   
   One last issue to note is how to get the partition for a data file. The current import code assumes that files are coming from a Hive table layout and converts the path to partition values as Hive would. We will need to make sure that an import procedure has a plan for handling this.
   
   And I should note that only Hive partition paths can be parsed. Iceberg purposely makes no guarantee that partition values can be recovered from paths and considers partition values to path conversion to be one-way. So we may want to have a way to pass the partition tuple. A struct is one option, but that makes it especially hard to use the date transforms. Another option is to import everything at the top level, or to try to infer values from lower/upper bounds in column stats.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on issue #2068: Procedure for adding files to a Table

Posted by GitBox <gi...@apache.org>.

aokolnychyi commented on issue #2068:
URL: https://github.com/apache/iceberg/issues/2068#issuecomment-758743089


   > I was wondering around overwrite, I feel like it would be safer to do that in two user operations rather than having this operation delete information. Since a user may not know which partitions will be touched in a given import
   
   I can go either way. I was coming more from `LOAD DATA INPATH ... INTO TABLE t` perspective that is supported by Spark for Hive tables.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on issue #2068: Procedure for adding files to a Table

Posted by GitBox <gi...@apache.org>.

aokolnychyi commented on issue #2068:
URL: https://github.com/apache/iceberg/issues/2068#issuecomment-804653778


   Fixed by #2210.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on issue #2068: Procedure for adding files to a Table

Posted by GitBox <gi...@apache.org>.

aokolnychyi commented on issue #2068:
URL: https://github.com/apache/iceberg/issues/2068#issuecomment-758738846


   I'd be also curious to hear thoughts from people who work on other query engines as well.
   
   cc @openinx @Parth-Brahmbhatt @electrum @jackye1995 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on issue #2068: Procedure for adding files to a Table

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on issue #2068:
URL: https://github.com/apache/iceberg/issues/2068#issuecomment-759616650

I think one thing to note is that it is always possible for files to not match the Iceberg spec which makes the checking for validity of added files a bit difficult. I definitely do think we should end up doing at least a footer read for getting metadata but I'm not sure we can/should require that the schemas match exactly. For example if a file has additional columns not in the spec should it be rejected?

In our internal code we just make a Mapping based on the icebergTable (MappingUtil.create(icebergTable.schema) and then just apply that.

@karuppayya
Actually was discussing an issue with our current Migrate/Snapshot code that has a similar issue. For example if you create a file with a column "iD" and create an external hive table referring to that column with "id", that table can be read by Spark but when converting such a table to Iceberg we would look for "id" and not "iD" so the column would not be mapped correctly.

@jackye1995
We currently use the SparkTableUtil's listPartition function. It has custom code for parquet, orc, and avro and uses the directory structure to determine which files belong to which partitions. After that it's just a getFile, table.append.appendFile. Internally we also provide an option for specifying a specific partition to import.

I agree that this is a bit of a low level method but for systems where rewrite is impossible or expensive I think it may be a good way to move files (or remake metadata for files).

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on issue #2068: Procedure for adding files to a Table

Posted by GitBox <gi...@apache.org>.

aokolnychyi commented on issue #2068:
URL: https://github.com/apache/iceberg/issues/2068#issuecomment-758861786


   We may not need a separate procedure. Just use DELETE FROM with a partition predicate. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rymurr edited a comment on issue #2068: Procedure for adding files to a Table

Posted by GitBox <gi...@apache.org>.

rymurr edited a comment on issue #2068:
URL: https://github.com/apache/iceberg/issues/2068#issuecomment-759466345

cc @vvellanki

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on issue #2068: Procedure for adding files to a Table

Posted by GitBox <gi...@apache.org>.

rdblue commented on issue #2068:
URL: https://github.com/apache/iceberg/issues/2068#issuecomment-758858143


   +1 for separate operations to remove files. There is no need for an atomic swap when you're working with file lists, so it should be sufficient to delete the data and re-append the files.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on issue #2068: Procedure for adding files to a Table

Posted by GitBox <gi...@apache.org>.

aokolnychyi commented on issue #2068:
URL: https://github.com/apache/iceberg/issues/2068#issuecomment-758852713


   Don't feel strongly here but we will need a solution like that internally. Offering a separate procedure for that is also fine. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on issue #2068: Procedure for adding files to a Table

Posted by GitBox <gi...@apache.org>.

aokolnychyi commented on issue #2068:
URL: https://github.com/apache/iceberg/issues/2068#issuecomment-758734891


   I agree with most of your points, @RussellSpitzer.
   
   I'd probably consider naming it `import_files` or `import_data` or `load_data` and add a flag whether it should overwrite the existing data or not.
   
   ```
   cat.system.import_data(table => 'db.tbl', path => 'path/to/dir or file', format => 'orc', overwrite => true) 
   ```
   
   Then `path` can be either a root table location, a partition path or a path to an individual file (if we want to support that).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on issue #2068: Procedure for adding files to a Table

Posted by GitBox <gi...@apache.org>.

aokolnychyi commented on issue #2068:
URL: https://github.com/apache/iceberg/issues/2068#issuecomment-758853336


   There is some sense in offering `add_files` and `replace_partitions` too.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org