You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/07/16 03:05:42 UTC

[GitHub] [iceberg] jackye1995 opened a new issue #2833: Unified syntax for system table names

jackye1995 opened a new issue #2833:
URL: https://github.com/apache/iceberg/issues/2833


   Currently Spark and Flink uses syntax `db.table.system_table_name` to access system tables, whereas Trino uses `db.table$system_table_name`. I also saw on dev list that Peter is planning to add system table support to Hive. I am also scoping the snapshot tagging feature which will add more complexity to the table naming scheme. So I think it's a good time to discuss what is the best syntax going forward.
   
   I remember in #1144 that we realized there is an issue for Spark to use dot as the delimiter for default catalog, and it was never truly fixed. I saw Ryan had the suggestion for using `__` instead. I would like to know what is everyone's take on this, so that we can provide a more unified experience for all users.
   
   @rdblue @electrum @RussellSpitzer @openinx @pvary 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on issue #2833: Unified syntax for system table names

Posted by GitBox <gi...@apache.org>.

rdblue commented on issue #2833:
URL: https://github.com/apache/iceberg/issues/2833#issuecomment-882721921


   @pvary, in Spark the rules are:
   
   1. If the name has one identifier part, then use the current catalog and namespace with identifier as the table name
   2. If the multi-part identifier does not start with a catalog name, use the current catalog with the identifier's namespace and table name
   3. If the multi-part identifier starts with a catalog name, it is a full identifier. Use the catalog, namespace, and table name from the identifier
   
   Those never produce ambiguity. The trade-off is that if you use a table name like `customers.history` (where `customers` is a table) then Spark will not fill in the current database/schema name for the namespace. Spark would be able to find `customers` and resolve it to `current_catalog.current_namespace.customers` though.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on issue #2833: Unified syntax for system table names

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on issue #2833:
URL: https://github.com/apache/iceberg/issues/2833#issuecomment-882732111

@pvary Spark looks for the first valid references to catalog or database before loading the table, we actually copied the logic here

https://github.com/apache/iceberg/blob/01393a06c284175edab75de34f48b2bfbd606081/spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java#L80-L79

But you end up checking
```
1. "Does the name have a single part?"
a. "Load defaultCatalog - defaultDatabase - name"
2. Can I treat the first part of the name as a catalog?
a. Use the catalog, use the last element of the name as the table name, everything else is database
b. "Load firstPart as Catalog, middlePart as Database, lastPart as TableName" - If database is empty, use default database
3. Use the default Catalog
a. Use last element as table name, everything else as database
b. if database is empty use default database
```

So say we have no table "history"

"customer.history" refers to the metadata table

If you have a database customer and table history and iceberg table customer

"customer.customer.history" refers to the metadata table (customer.history is the other table)

If you have a catalog customer and database customer and table history and table customer

"customer.customer.customer.history" refers to the metadata table "history"

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] findepi commented on issue #2833: Unified syntax for system table names

Posted by GitBox <gi...@apache.org>.

findepi commented on issue #2833:
URL: https://github.com/apache/iceberg/issues/2833#issuecomment-883148258


   As long as table name is delimited (`"customer.history"`) it's unambiguously an identifier.
   However, when written as `customer.history`, it is a qualified name, and this probably has far reaching implications with the SQL spec, which governs how names are resolved.
   
   @martint do you think it's possible to use dot-separator for system tables and also obey SQL specification?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] losipiuk commented on issue #2833: Unified syntax for system table names

Posted by GitBox <gi...@apache.org>.

losipiuk commented on issue #2833:
URL: https://github.com/apache/iceberg/issues/2833#issuecomment-881375493


   cc: @findepi 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] pvary commented on issue #2833: Unified syntax for system table names

Posted by GitBox <gi...@apache.org>.

pvary commented on issue #2833:
URL: https://github.com/apache/iceberg/issues/2833#issuecomment-882707266






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] pvary commented on issue #2833: Unified syntax for system table names

Posted by GitBox <gi...@apache.org>.

pvary commented on issue #2833:
URL: https://github.com/apache/iceberg/issues/2833#issuecomment-883057433


   Thanks @rdblue and @RussellSpitzer for the detailed answer. Let's translate it for Hive where we do not have Catalogs yet for the queries:
   - If we have a single part identifier use the default database and return the data table
   - If we have multipart identifier, then expect it to be a full identifier (no defaults here)
   
   This results the same algorithm that we come up with @marton-bod and mentioned in my first comment:
   > you have to always provide the db if you want to access the metadata tables
   
   But at least this does not seem so lame anymore 😄
   
   Also it could be simplified to:
   - If we have a 3 part identifier use the last part as a metadata table type, and use the rest as a table identifier
   - If the identifier has 2 or fewer parts do not change the behaviour
   
   Seems like a manageable change to me. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] findepi commented on issue #2833: Unified syntax for system table names

Posted by GitBox <gi...@apache.org>.

findepi commented on issue #2833:
URL: https://github.com/apache/iceberg/issues/2833#issuecomment-883148258


   As long as table name is delimited (`"customer.history"`) it's unambiguously an identifier.
   However, when written as `customer.history`, it is a qualified name, and this probably has far reaching implications with the SQL spec, which governs how names are resolved.
   
   @martint do you think it's possible to use dot-separator for system tables and also obey SQL specification?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] pvary commented on issue #2833: Unified syntax for system table names

Posted by GitBox <gi...@apache.org>.

pvary commented on issue #2833:
URL: https://github.com/apache/iceberg/issues/2833#issuecomment-883039932


   > The easy advice for users is “don’t create tables with dots in the name” which is that something that IIRC the Hive metastore doesn’t allow
   
   AFAIK Hive nowadays allows to create table names with dots, but you should backquote them. I am not sure about the released versions, and as a user I would be cautious about depending on it, but theoretical we can handle them. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org