You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/07/26 09:01:09 UTC

[GitHub] [iceberg] sshkvar commented on pull request #2850: Spark: Added ability to add uuid suffix to the table location in Hive catalog

sshkvar commented on pull request #2850:
URL: https://github.com/apache/iceberg/pull/2850#issuecomment-886514901


   > In testing etc, I very often use a similar pattern (possibly using a timestamp as the table suffix).
   > 
   > However, I'm not sure if the best place to be doing this is in the Iceberg code.
   > 
   > What other tools are you using to create these tables that have UUID suffixes? Usually, when I encounter this need, I'm doing it in one of two places:
   > (1) Directly from shell scripts or small Spark / Trino jobs when testing on S3 (and wanting to ensure a brand new table). The solution for me there is simply to either place the table name with a timestamp in the code. Here's a sample from some code I have elsewhere:
   > 
   > ```scala
   >     val currentTime = new Date().getTime
   >     val tableName = "table_" + currentTime;
   >     spark.sql(s"CREATE TABLE IF NOT EXISTS my_catalog.default.${tableName} (name string, age int) USING iceberg")
   > ```
   > 
   > (2) From some sort of scheduling tool, such as Airflow or Azkaban. In this case, it's very easy to create a UUID when passing In the "new table name" to the spark job.
   > 
   > Effectively, for me, I'm not sure if this is something that makes sense to place it in Iceberg.
   > 
   > Can you elaborate further on why this isn't something that you can pass as an argument to your jobs etc? It feels very use case specific, with possible ways for you to deal with it using existing tools, but maybe I'm not fully understanding the scope of your problem. 🙂
   
   @kbendick Thanks for the quick reply 
   Let me provide additional details.
   Actually we do not need to change table name (and we don't do it), in 
    this PR just add uuid suffix to the table location.
   We need this to store tables with same name in different "folders" on s3. 
   Our use-case:
   1. We created table with name `test_table` and inserted some data to this table
   2. Then we dropped this table only from metastore, because we should have ability to restore this table 
   3. Then we again created new table with same name `test_table`.
   4. And again dropped this table
   With this PR we will be able to restore any of this tables, because data and metadata placed in different folders, we just need to restore information about table location in metastore (we can easy do it via iceberg API). 
   
   Also we have scheduled compaction and orhan files cleanup processes. If we will have data and metadata files for both tables in same folder, orhan files cleanup process will delete data and metadata for table which was deleted in step 2.
   Based on described above `EXTERNAL` table is not an option for us 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org