You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/07/22 12:42:24 UTC

[GitHub] [iceberg] sshkvar opened a new pull request #2850: Spark: Added ability to add uuid suffix to the table location in Hive catalog

sshkvar opened a new pull request #2850:
URL: https://github.com/apache/iceberg/pull/2850


   Hi, I would like to add ability to have unique table location for each new table in hive catalog.
   
   Let me provide some details why I need it.
   We have 2 main engines which works with iceberg tables: trino and spark. For storing metadata we are using Hive metastore. Data and metadata stored on s3. We also have a requirement don't drop table data and metadata when dropping table from hive metastore. It will allow us to restore any dropped table. 
   
   In this PR I added new Catalog property `APPEND_UUID_SUFFIX_TO_TABLE_LOCATION`. If this property is set to `true`, on table create action we will add UUID suffix (like s3://bucket/tableName-UUID) which will be used for building table location.
   
   @RussellSpitzer could you please take a look?  
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary edited a comment on pull request #2850: Spark: Added ability to add uuid suffix to the table location in Hive catalog

Posted by GitBox <gi...@apache.org>.
pvary edited a comment on pull request #2850:
URL: https://github.com/apache/iceberg/pull/2850#issuecomment-886556438


   @sshkvar: I think you could use the following Hive query to archive the desired results:
   ```
   CREATE EXTERNAL TABLE customers (id INTEGER)
   STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'
   LOCATION '<WAREHOUSE>/934b0724-b4da-4202-aa90-4d6e9543b8b4//default/customers';
   ```
   
   Would this work for you?
   Thanks,
   Peter


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] sshkvar commented on pull request #2850: Spark: Added ability to add uuid suffix to the table location in Hive catalog

Posted by GitBox <gi...@apache.org>.
sshkvar commented on pull request #2850:
URL: https://github.com/apache/iceberg/pull/2850#issuecomment-886564565


   > @sshkvar: I think you could use the following Hive query to archive the desired results:
   > 
   > ```
   > CREATE EXTERNAL TABLE customers (id INTEGER)
   > STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'
   > LOCATION '<WAREHOUSE>/934b0724-b4da-4202-aa90-4d6e9543b8b4//default/customers';
   > ```
   > 
   > Would this work for you?
   > Thanks,
   > Peter
   
   @pvary yes, this solution can be used, but in this case we need to change a lot of existing ETL's and ask people to set custom location for each new table. We would like to have some general option


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on pull request #2850: Spark: Added ability to add uuid suffix to the table location in Hive catalog

Posted by GitBox <gi...@apache.org>.
pvary commented on pull request #2850:
URL: https://github.com/apache/iceberg/pull/2850#issuecomment-888925149


   Finally I put together this in my head with the same issue in Hive:
   - https://issues.apache.org/jira/browse/HIVE-24445
   
   Wouldn't this solve both the issues?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary edited a comment on pull request #2850: Spark: Added ability to add uuid suffix to the table location in Hive catalog

Posted by GitBox <gi...@apache.org>.
pvary edited a comment on pull request #2850:
URL: https://github.com/apache/iceberg/pull/2850#issuecomment-886556438


   @sshkvar: I think you could use the following Hive query to archive the desired results:
   ```
   CREATE EXTERNAL TABLE customers
   STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'
   LOCATION '<WAREHOUSE>/934b0724-b4da-4202-aa90-4d6e9543b8b4//default/customers';
   ```
   
   Would this work for you?
   Thanks,
   Peter


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #2850: Spark: Added ability to add uuid suffix to the table location in Hive catalog

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #2850:
URL: https://github.com/apache/iceberg/pull/2850#issuecomment-886779439


   I guess i'm also a little apprehensive of adding more parameters when this could be done in user space pretty easily. But if other folks have this need more frequently I don't think it's unreasonable. There is actually a similar thing in C* which always prefixes table names with UUID's when creating their actual locations to avoid these kind of conflicts.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] sshkvar commented on pull request #2850: Spark: Added ability to add uuid suffix to the table location in Hive catalog

Posted by GitBox <gi...@apache.org>.
sshkvar commented on pull request #2850:
URL: https://github.com/apache/iceberg/pull/2850#issuecomment-886514901


   > In testing etc, I very often use a similar pattern (possibly using a timestamp as the table suffix).
   > 
   > However, I'm not sure if the best place to be doing this is in the Iceberg code.
   > 
   > What other tools are you using to create these tables that have UUID suffixes? Usually, when I encounter this need, I'm doing it in one of two places:
   > (1) Directly from shell scripts or small Spark / Trino jobs when testing on S3 (and wanting to ensure a brand new table). The solution for me there is simply to either place the table name with a timestamp in the code. Here's a sample from some code I have elsewhere:
   > 
   > ```scala
   >     val currentTime = new Date().getTime
   >     val tableName = "table_" + currentTime;
   >     spark.sql(s"CREATE TABLE IF NOT EXISTS my_catalog.default.${tableName} (name string, age int) USING iceberg")
   > ```
   > 
   > (2) From some sort of scheduling tool, such as Airflow or Azkaban. In this case, it's very easy to create a UUID when passing In the "new table name" to the spark job.
   > 
   > Effectively, for me, I'm not sure if this is something that makes sense to place it in Iceberg.
   > 
   > Can you elaborate further on why this isn't something that you can pass as an argument to your jobs etc? It feels very use case specific, with possible ways for you to deal with it using existing tools, but maybe I'm not fully understanding the scope of your problem. 🙂
   
   @kbendick Thanks for the quick reply 
   Let me provide additional details.
   Actually we do not need to change table name (and we don't do it), in 
    this PR just add uuid suffix to the table location.
   We need this to store tables with same name in different "folders" on s3. 
   Our use-case:
   1. We created table with name `test_table` and inserted some data to this table
   2. Then we dropped this table only from metastore, because we should have ability to restore this table 
   3. Then we again created new table with same name `test_table`.
   4. And again dropped this table
   With this PR we will be able to restore any of this tables, because data and metadata placed in different folders, we just need to restore information about table location in metastore (we can easy do it via iceberg API). 
   
   Also we have scheduled compaction and orhan files cleanup processes. If we will have data and metadata files for both tables in same folder, orhan files cleanup process will delete data and metadata for table which was deleted in step 2.
   Based on described above `EXTERNAL` table is not an option for us 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] sshkvar commented on pull request #2850: Spark: Added ability to add uuid suffix to the table location in Hive catalog

Posted by GitBox <gi...@apache.org>.
sshkvar commented on pull request #2850:
URL: https://github.com/apache/iceberg/pull/2850#issuecomment-886591554


   > @sshkvar: Do you think is a general enough request to change the HiveCatalog code, or it is ok for you to create your own CustomCatalog and use it in your deployment?
   
   As for me it can be useful in case when we not deleting table data. 
   But if you would like to avoid this changes we can try to find other solutions


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on pull request #2850: Spark: Added ability to add uuid suffix to the table location in Hive catalog

Posted by GitBox <gi...@apache.org>.
kbendick commented on pull request #2850:
URL: https://github.com/apache/iceberg/pull/2850#issuecomment-886463630


   In testing etc, I very often use a similar pattern (possibly using a timestamp as the table suffix).
   
   However, I'm not sure if the best place to be doing this is in the Iceberg code.
   
   What other tools are you using to create these tables that have UUID suffixes? Usually, when I encounter this need, I'm doing it in one of two places:
   (1) Directly from shell scripts or small Spark / Trino jobs when testing on S3 (and wanting to ensure a brand new table). The solution for me there is simply to either place the table name with a timestamp in the code. Here's a sample from some code I have elsewhere:
   ```scala
       val currentTime = new Date().getTime
       val tableName = "table_" + currentTime;
       spark.sql(s"CREATE TABLE IF NOT EXISTS my_catalog.default.${tableName} (name string, age int) USING iceberg")
   ``` 
   (2) From some sort of scheduling tool, such as Airflow or Azkaban. In this case, it's very easy to create a UUID when passing In the "new table name" to the spark job.
   
   Effectively, for me, I'm not sure if this is something that makes sense to place it in Iceberg.
   
   Can you elaborate further on why this isn't something that you can pass as an argument to your jobs etc? It feels very use case specific, with possible ways for you to deal with it using existing tools, but maybe I'm not fully understanding the scope of your problem. 🙂 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on pull request #2850: Spark: Added ability to add uuid suffix to the table location in Hive catalog

Posted by GitBox <gi...@apache.org>.
pvary commented on pull request #2850:
URL: https://github.com/apache/iceberg/pull/2850#issuecomment-886696199


   > But if you would like to avoid this changes we can try to find other solutions
   
   @sshkvar: I would like to hear others opinion on this too before we come to any decision.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] sshkvar commented on pull request #2850: Spark: Added ability to add uuid suffix to the table location in Hive catalog

Posted by GitBox <gi...@apache.org>.
sshkvar commented on pull request #2850:
URL: https://github.com/apache/iceberg/pull/2850#issuecomment-895177137


   @RussellSpitzer @kbendick the same functionality already merged to Trino Iceberg connector https://github.com/trinodb/trino/pull/6063
   Also as @pvary mentioned the same PR opened to the Hive https://issues.apache.org/jira/browse/HIVE-24906
   
   This functionality will allow us to avoid issue with table rename as I described above. 
   I will be really appreciate for the review, I am ready to work on it and want to hear you opinion about it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] sshkvar commented on pull request #2850: Spark: Added ability to add uuid suffix to the table location in Hive catalog

Posted by GitBox <gi...@apache.org>.
sshkvar commented on pull request #2850:
URL: https://github.com/apache/iceberg/pull/2850#issuecomment-888868193


   Another important reason why we need unique table location is `Rename table` action.
   We have a cases when users changing name of tables and then create tables with same name, like:
   1. User create table with name `table1`
   2. Then rename it to `table2`
   3. Create new table with name `table1`
   
   In this case data and metadata files will be placed in same 'folder' on s3. When we will perform remove orphan files action for example for `table1` it will remove data and metadata files for `table2`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on pull request #2850: Spark: Added ability to add uuid suffix to the table location in Hive catalog

Posted by GitBox <gi...@apache.org>.
kbendick commented on pull request #2850:
URL: https://github.com/apache/iceberg/pull/2850#issuecomment-886464531


   >  We also have a requirement don't drop table data and metadata when dropping table from hive metastore. It will allow us to restore any dropped table.
   
   I believe this is possible using `EXTERNAL` tables already (where the table is dropped from HMS but th data is retained). Is this not sufficient for your use case?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on pull request #2850: Spark: Added ability to add uuid suffix to the table location in Hive catalog

Posted by GitBox <gi...@apache.org>.
pvary commented on pull request #2850:
URL: https://github.com/apache/iceberg/pull/2850#issuecomment-886556438


   @sshkvar: I think you could use the following Hive query to archive the desired results:
   ```
   CREATE EXTERNAL TABLE customers STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' LOCATION '<WAREHOUSE>/934b0724-b4da-4202-aa90-4d6e9543b8b4//default/customers';
   ```
   
   Would this work for you?
   Thanks,
   Peter


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] findepi commented on pull request #2850: Spark: Added ability to add uuid suffix to the table location in Hive catalog

Posted by GitBox <gi...@apache.org>.
findepi commented on pull request #2850:
URL: https://github.com/apache/iceberg/pull/2850#issuecomment-887415747


   FWIW here is the related change in Trino https://github.com/trinodb/trino/pull/6063
   and related discussion: https://github.com/trinodb/trino/issues/5632#issuecomment-714274998


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] sshkvar commented on pull request #2850: Spark: Added ability to add uuid suffix to the table location in Hive catalog

Posted by GitBox <gi...@apache.org>.
sshkvar commented on pull request #2850:
URL: https://github.com/apache/iceberg/pull/2850#issuecomment-889849331


   This PR adds ability to have unique table location. 
   In this case each new table will have unique path on s3. 
   Based on it example from my previous message will looks:
   1. User create table with name `table1` (location: `s3://iceberg/table1-eecec917-53b8-4d28-a900-c88061fc7ac2`)
   2. Then rename it to `table2` (location not change during update `s3://iceberg/table1-eecec917-53b8-4d28-a900-c88061fc7ac2`) 
   3. Create new table with name `table1` (location: `s3://iceberg/table1-309f7903-c4b2-4df1-b890-a481bb47c4c5`)
   
   As we have unique location we can perform remove orphan files (compaction) for bot tables safety.
   Looks like it can be pretty general case. 
   @RussellSpitzer @kbendick what do you think? Is it make sense to add such property to Hive catalog? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary edited a comment on pull request #2850: Spark: Added ability to add uuid suffix to the table location in Hive catalog

Posted by GitBox <gi...@apache.org>.
pvary edited a comment on pull request #2850:
URL: https://github.com/apache/iceberg/pull/2850#issuecomment-888925149


   Finally I put together this in my head with the same issue in Hive:
   - https://issues.apache.org/jira/browse/HIVE-24445
   
   Wouldn't this solve both the issues?
   
   CC: @deniskuzZ, @zchovan


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on pull request #2850: Spark: Added ability to add uuid suffix to the table location in Hive catalog

Posted by GitBox <gi...@apache.org>.
pvary commented on pull request #2850:
URL: https://github.com/apache/iceberg/pull/2850#issuecomment-886573458


   @sshkvar: Do you think is a general enough request to change the HiveCatalog code, or it is ok for you to create your own CustomCatalog and use it in your deployment?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org