You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/02/04 23:07:07 UTC

[GitHub] [iceberg] rdblue commented on issue #2068: Procedure for adding files to a Table

rdblue commented on issue #2068:
URL: https://github.com/apache/iceberg/issues/2068#issuecomment-773662070


   Just catching up on this thread.
   
   I agree with everything that @electrum said. Strong expectations help us. But we also need to have a way to handle existing data. That comes up all the time. Most existing data is tracked by name.
   
   To support existing data files, we built name mappings so that you can take table data that previously identified columns by name and attach IDs. As long as the files that used name-based schema evolution are properly mapped, the Iceberg table can carry on with id-based resolution without problems. I think this is a reasonable path forward to get schema IDs.
   
   Position-based schema evolution isn't very popular and doesn't work well with nested structures, so I think we should focus on name-based.
   
   I completely agree that we need to read part of each data file. For Parquet, we need to get column stats at a minimum, but we should also validate that at least one column is readable.
   
   I think that means that we have a few things to do to formally support this:
   1. Add name mapping to the Iceberg spec so that it is well-defined and we have test cases to validate
   2. Document how name mappings change when a schema evolves (allows adding aliases)
   3. Make sure that when we import files, there is a name mapping set for the table
   4. Build correct metadata from imported files based on the name mapping
   5. Identify problems with the name mapping, like files with no readable / mapped fields or incompatible data types
   
   One last issue to note is how to get the partition for a data file. The current import code assumes that files are coming from a Hive table layout and converts the path to partition values as Hive would. We will need to make sure that an import procedure has a plan for handling this.
   
   And I should note that only Hive partition paths can be parsed. Iceberg purposely makes no guarantee that partition values can be recovered from paths and considers partition values to path conversion to be one-way. So we may want to have a way to pass the partition tuple. A struct is one option, but that makes it especially hard to use the date transforms. Another option is to import everything at the top level, or to try to infer values from lower/upper bounds in column stats.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org