You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/06/01 12:02:01 UTC

[GitHub] [iceberg] lcspinter opened a new issue, #4930: Schema evolution on migrated Hive tables

lcspinter opened a new issue, #4930:
URL: https://github.com/apache/iceberg/issues/4930

   I've been playing with schema evolution on migrated Hive tables and I noticed some inconsistencies. Here is what I did
   
   Create a simple hive table
   `CREATE EXTERNAL TABLE customers (id int, first_name string, last_name string) STORED AS PARQUET;`
   Insert one record
   `INSERT INTO customers VALUES (11, 'Lisa', 'Truman');`
   Migrate it to Iceberg
   `ALTER TABLE customers SET TBLPROPERTIES " +
           "('storage_handler'='org.apache.iceberg.mr.hive.HiveIcebergStorageHandler')`
   Drop a column
   `ALTER TABLE customers REPLACE COLUMNS (id int, last_name string)`
   Readd the same column
   `ALTER TABLE customers ADD COLUMNS (first_name string)`
   
   Running a select query on the readded column will give back the previously inserted record, which I believe is the expected outcome. 
   
   I've added an additional step to the test scenario. Before dropping the column I renamed it. 
   `ALTER TABLE customers CHANGE COLUMN first_name first_name_1 string`
   
   If I readd the renamed column and run a query on it I get `null` values. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #4930: Schema evolution on migrated Hive tables

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #4930:
URL: https://github.com/apache/iceberg/issues/4930#issuecomment-1144793453

   Maybe you should restate the problem because that example is example the resolution of name to column that I would imagine in hive 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #4930: Schema evolution on migrated Hive tables

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #4930:
URL: https://github.com/apache/iceberg/issues/4930#issuecomment-1170300896

   @szehon-ho + @karuppayya I know both of you have been looking into this area as well. Basically the issue here is our "ALTER TABLE" commands are populating the default row mapping in a way that resurrects data which was marked as being deleted by previous DDL ops.
   
   Se my post directly above for an example


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] karuppayya commented on issue #4930: Schema evolution on migrated Hive tables

Posted by GitBox <gi...@apache.org>.
karuppayya commented on issue #4930:
URL: https://github.com/apache/iceberg/issues/4930#issuecomment-1170349816

   @RussellSpitzer Does how does Iceberg currently handle dropping a column and add the column with same name again, is this scenario similar?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #4930: Schema evolution on migrated Hive tables

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #4930:
URL: https://github.com/apache/iceberg/issues/4930#issuecomment-1144794865

   Oh sorry I see, that alter table statement will not actually migrate a table to iceberg as far as I know. It just marks a table as being an iceberg table. Only Spark currently had migrate and snapshot actions for converting a table


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] lcspinter commented on issue #4930: Schema evolution on migrated Hive tables

Posted by GitBox <gi...@apache.org>.
lcspinter commented on issue #4930:
URL: https://github.com/apache/iceberg/issues/4930#issuecomment-1144492936

   @RussellSpitzer Thanks for your reply.
   
   > Data in old files will not be read because we consider that data to be associated with the old column which doesn't exist
   Based on my investigations this might be not true. Please check the first part of my example. If I drop a column without renaming it and right after I readd it, I'm able to read the old data. This is what confuses me. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #4930: Schema evolution on migrated Hive tables

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #4930:
URL: https://github.com/apache/iceberg/issues/4930#issuecomment-1170378592

   @karuppayya So the issue here is solely for files which do not have fieldIds. These files are read using what we call a default name mapping specified : https://iceberg.apache.org/spec/#column-projection
   
   > Tables may also define a property schema.name-mapping.default with a JSON name mapping containing a list of field mapping objects. These mappings provide fallback field ids to be used when a data file does not contain field id information. Each object should contain
   > 
   > names: A required list of 0 or more names for a field.
   > field-id: An optional Iceberg field ID used when a field’s name is present in names
   > fields: An optional list of field mappings for child field of structs, maps, and lists.
   
   To walk through this whole scenario.
   
   Imagine I have a table `A, B, C` which iceberg will internally note as Fields (0, 1, 2)
   
   I import a file (HiveFile) from hive that contains `A, B, C`. This triggers (at least in some of our actions like Migrate or Snapshot) the creation of a Name Mapping. 
   ```
   { 
     A -> 0,
     B -> 1, 
     C -> 2
     }
   ```
   
   When reading HiveFile I see that I only have column names, no field IDs are present in the File metadata. So I use the mapping specified above to say which columns actually belong to which fields.
   
   A file written by any Iceberg writer (IcebergFile) would contain embedded in it the mapping of this particular file. 
   
   ```
   Footer {
   {  iceberg.schema {
   A -> 0, B -> 1, C -> 2
   }}
   ```
   
   Now when we drop C and ADD C our new Table still has names
   `A, B, C` but fields `A -> 0, B->  1, C->3` 
   
   So for IcebergFile we don't have a problem. Because we look at that file and see it only has field "2" not "3" so it can't possibly have values for our new C (3).
   
   Now the Table's name mapping should have changed here. The C that we originally mapped "C -> 2" was dropped so we should no longer map the column from the Hive table. But I believe our current behavior is to change the name mapping so that the name mapping also contains "C -> 3". This is incorrect


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] github-actions[bot] commented on issue #4930: Schema evolution on migrated Hive tables

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on issue #4930:
URL: https://github.com/apache/iceberg/issues/4930#issuecomment-1369283496

   This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] lcspinter commented on issue #4930: Schema evolution on migrated Hive tables

Posted by GitBox <gi...@apache.org>.
lcspinter commented on issue #4930:
URL: https://github.com/apache/iceberg/issues/4930#issuecomment-1149920747

   @RussellSpitzer Could you please help me understand what should be the correct behaviour? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] github-actions[bot] commented on issue #4930: Schema evolution on migrated Hive tables

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on issue #4930:
URL: https://github.com/apache/iceberg/issues/4930#issuecomment-1386268858

   This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] github-actions[bot] closed issue #4930: Schema evolution on migrated Hive tables

Posted by GitBox <gi...@apache.org>.
github-actions[bot] closed issue #4930: Schema evolution on migrated Hive tables
URL: https://github.com/apache/iceberg/issues/4930


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] lcspinter commented on issue #4930: Schema evolution on migrated Hive tables

Posted by GitBox <gi...@apache.org>.
lcspinter commented on issue #4930:
URL: https://github.com/apache/iceberg/issues/4930#issuecomment-1145796632

   @RussellSpitzer 
   Migration to iceberg is already supported in hive. When I run
   `ALTER TABLE customers SET TBLPROPERTIES " + "('storage_handler'='org.apache.iceberg.mr.hive.HiveIcebergStorageHandler')` it will import the files and generate the metadata. Both spark and hive are using the `TableMigrationUtil`to perform the migration.
   
   Nonetheless, the issue is reproducible from spark as well. I wrote a unit test and the symptoms are the same 
   - after table migration to iceberg if I drop and readd a column I'm able to read old data
   - after table migration to iceberg if I rename, drop and readd a column the data is not read back


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #4930: Schema evolution on migrated Hive tables

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #4930:
URL: https://github.com/apache/iceberg/issues/4930#issuecomment-1150127669

   I believe the proper behavior should be that neither allows the old data to be read back. Currently I wrote a repo
   
   ```scala
   scala> spark.sql("CREATE external TABLE migratetest (foo int, bar int, zaz int) USING PARQUET LOCATION '/Users/russellspitzer/Temp/migratetest'").show
   
   scala> spark.sql("INSERT INTO migratetest (foo, bar , zaz) VALUES (1, 1, 1)")
   res7: org.apache.spark.sql.DataFrame = []
   
   scala> spark.sql("call spark_catalog.system.migrate('spark_catalog.default.migratetest')")
   res8: org.apache.spark.sql.DataFrame = [migrated_files_count: bigint]
   
   scala> spark.sql("SELECT * FROM migratetest").show
   +---+---+---+
   |foo|bar|zaz|
   +---+---+---+
   |  1|  1|  1|
   +---+---+---+
   
   scala> spark.sql("ALTER TABLE migratetest DROP COLUMN foo")
   res10: org.apache.spark.sql.DataFrame = []
   
   scala> spark.sql("SELECT * FROM migratetest").show
   +---+---+
   |bar|zaz|
   +---+---+
   |  1|  1|
   +---+---+
   
   scala> spark.sql("ALTER TABLE migratetest ADD COLUMN foo int")
   res12: org.apache.spark.sql.DataFrame = []
   
   scala> spark.sql("SELECT * FROM migratetest").show
   +---+---+---+
   |bar|zaz|foo|
   +---+---+---+
   |  1|  1|  1|
   ```
   
   The issue here is that the default name mapping is changed when the second foo column is added, overriding the original name mapping.
   
   Name Mapping in original table : Foo maps to 1
   ```json
   [
    {\n  \"field-id\" : 1,\n  \"names\" : [ \"foo\" ]\n},
    {\n  \"field-id\" : 2,\n  \"names\" : [ \"bar\" ]\n}, 
    {\n  \"field-id\" : 3,\n  \"names\" : [ \"zaz\" ]\n} ]```
   ```
   
   Name Mapping after dropping "foo" : Foo still maps to 1
   ```json
   [ {\n  \"field-id\" : 1,\n  \"names\" : [ \"foo\" ]\n}, 
   {\n  \"field-id\" : 2,\n  \"names\" : [ \"bar\" ]\n}, 
   {\n  \"field-id\" : 3,\n  \"names\" : [ \"zaz\" ]\n} 
   |]
   
   Name Mapping after adding "foo" back : Foo now maps to 4 *This is incorrect we should not be changing the existing mapping*
   ```json
   [ {\n  \"field-id\" : 1,\n  \"names\" : [ ]\n}, 
   {\n  \"field-id\" : 2,\n  \"names\" : [ \"bar\" ]\n},
    {\n  \"field-id\" : 3,\n  \"names\" : [ \"zaz\" ]\n}, 
    {\n  \"field-id\" : 4,\n  \"names\" : [ \"foo\" ]\n} ]
   ```
   
   I'm on vacation now so i'm not going to look into this more, but IMHO that final default name mapping should be identical to the one when dropping the column. So the error here is in the "ADD COLUMN" code.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #4930: Schema evolution on migrated Hive tables

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #4930:
URL: https://github.com/apache/iceberg/issues/4930#issuecomment-1143665116

   I'm assuming the second example is with an Iceberg table.
   
   Iceberg does not do purely by name resolution, this is one of the advantages over a system like hive. When a column is dropped all of the data in the table for that column at the time is dropped. If another column is added later with the same name it is treated as a brand new column. Data in old files will not be read because we consider that a completely different column.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org