You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/12/04 03:55:30 UTC

[GitHub] [iceberg] rajarshisarkar opened a new pull request #3668: AWS: Remove Glue database validation

rajarshisarkar opened a new pull request #3668:
URL: https://github.com/apache/iceberg/pull/3668


   There seems to be an inconsistency in the public documentation, https://docs.aws.amazon.com/athena/latest/ug/glue-best-practices.html#schema-names. As per the doc, the database name has the following constraints:
   
   - A database name cannot be longer than 255 characters.
   - The only acceptable characters for database names, table names, and column names are lowercase letters, numbers, and the underscore character.
   
   However, I was able to create the following databases (in `us-east-1` region):
   
   ```
   aws glue create-database --database-input "{\"Name\":\"tempdb-test\"}"
   aws glue create-database --database-input "{\"Name\":\"tempdb-test$21.asdj123\"}"
   aws glue create-database --database-input "{\"Name\":\"tempdb-test$21.asdj123||\"}"
   aws glue create-database --database-input "{\"Name\":\"tempdb-test$21.asdj123||#^@#^$@$$!@#@!#\"}"
   ```
   
   I am removing the namespace validation for time being (we can do that if we don’t gain any traction). I have reached out to Glue Team for clarification on the public documentation as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rajarshisarkar commented on pull request #3668: AWS: Make the error message of GlueCatalog's isValidIdentifier more detailed

Posted by GitBox <gi...@apache.org>.
rajarshisarkar commented on pull request #3668:
URL: https://github.com/apache/iceberg/pull/3668#issuecomment-992498341


   Thanks for the inputs, @jackye1995 @kbendick @yyanyy.
   
   I have pushed a change to make the error message more detailed highlighting which part (database or table name) failed the validation. The change also throws a better exception like `ValidationException` instead of `IllegalArgumentException` since it’s too general.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick edited a comment on pull request #3668: AWS: Remove namespace validation from GlueCatalog's isValidIdentifier

Posted by GitBox <gi...@apache.org>.
kbendick edited a comment on pull request #3668:
URL: https://github.com/apache/iceberg/pull/3668#issuecomment-986402769


   > Glue is just trying to be generic, and the Athena doc requirement is based on Hive specification. I am trying to not remove this validation unless there is an absolute need. Using special characters cause a lot of incompatibility across different compute engines. Given the fact that people are likely going to create new namespaces for Iceberg tables, I think it's better to just advocate people to use normal characters instead of allowing too many variations.
   > 
   > Please let me know what you think, or if you have some specific use case to satisfy.
   
   I would also go off of the stricter requirement. From my knowledge, Glue is a bit more general purpose from it's earliest days. So I'm not surprised that its possibly more permissive in what it allows to be created via CLI etc. But it's unlikely that its just the Validation that's the only blocking issue.
   
   With many catalogs, there are limitations on the namespace / database name (or differences in how they're represented by the catalogs TableOperations).
   
   There are also engine-specific considerations, eg Athena (or Trinodb in the general case).
   
   I'm not at AWS (nor have I ever been affiliated), so I would of course defer to @jackye1995. But wanted to provide further insight as to why the change might not be as simple as removing Precondition checks (especially as the majority of the Trino integration code for Iceberg is hosted in the Trino repo). πŸ™‚


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on pull request #3668: AWS: Remove namespace validation from GlueCatalog's isValidIdentifier

Posted by GitBox <gi...@apache.org>.
kbendick commented on pull request #3668:
URL: https://github.com/apache/iceberg/pull/3668#issuecomment-986402769


   > Glue is just trying to be generic, and the Athena doc requirement is based on Hive specification. I am trying to not remove this validation unless there is an absolute need. Using special characters cause a lot of incompatibility across different compute engines. Given the fact that people are likely going to create new namespaces for Iceberg tables, I think it's better to just advocate people to use normal characters instead of allowing too many variations.
   > 
   > Please let me know what you think, or if you have some specific use case to satisfy.
   
   I would also go off of the stricter requirement. From my knowledge, Glue is a bit more general purpose from it's earliest dags. So I'm not surprised that its possibly more permissive in what it allows to be created via CLI etc. But it's unlikely that its just the Validation that's causing the hold up.
   
   With many catalogs, there are limitations on the database name (or differences in how they're represented by the catalogs TableOperations).
   
   There are also engine-specific considerations, eg Athena (or Trinodb in the general case).
   
   I'm not at AWS (nor have I ever been affiliated), so I would of course defer to @jackye1995. But wanted to provide further insight as to why the change might not be as simple as removing Precondition checks. πŸ™‚


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rajarshisarkar commented on pull request #3668: AWS: Remove Glue database validation

Posted by GitBox <gi...@apache.org>.
rajarshisarkar commented on pull request #3668:
URL: https://github.com/apache/iceberg/pull/3668#issuecomment-985961463


   @jackye1995 @yyanyy Please let me know your thoughts on this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] nastra commented on a change in pull request #3668: AWS: Make the error message of GlueCatalog's isValidIdentifier more detailed

Posted by GitBox <gi...@apache.org>.
nastra commented on a change in pull request #3668:
URL: https://github.com/apache/iceberg/pull/3668#discussion_r797816549



##########
File path: aws/src/test/java/org/apache/iceberg/aws/glue/TestGlueCatalog.java
##########
@@ -454,4 +455,43 @@ public void testRemoveProperties() {
         .when(glue).updateDatabase(Mockito.any(UpdateDatabaseRequest.class));
     glueCatalog.removeProperties(Namespace.of("db1"), Sets.newHashSet("key"));
   }
+
+  @Test
+  public void testValidIdentifier() {
+    String databaseNameValidationMessage = "Cannot convert namespace %s to Glue database name, because it must be " +
+        "1-252 chars of lowercase letters, numbers, underscore";
+    String tableNameValidationMessage = "Cannot use %s as Glue table name, because it must be 1-255 chars of " +
+        "lowercase letters, numbers, underscore";
+
+    // Positive Test Cases
+    Assert.assertTrue(glueCatalog.isValidIdentifier(TableIdentifier.of("db", "table")));
+    Assert.assertTrue(glueCatalog.isValidIdentifier(TableIdentifier.of("db0123456789", "table0123456789")));
+    Assert.assertTrue(glueCatalog.isValidIdentifier(TableIdentifier.of("db_0123456789", "table_0123456789")));
+
+    // Negative Test Cases
+    AssertHelpers.assertThrows("Database Name Invalid", ValidationException.class,
+        String.format(databaseNameValidationMessage, "DB"), () -> {
+          glueCatalog.isValidIdentifier(TableIdentifier.of("DB", "TABLE"));
+        });
+    AssertHelpers.assertThrows("Database Name Invalid", ValidationException.class,
+        String.format(databaseNameValidationMessage, "test-db"), () -> {
+          glueCatalog.isValidIdentifier(TableIdentifier.of("test-db", "table"));
+        });
+    AssertHelpers.assertThrows("Database Name Invalid", ValidationException.class,
+        String.format(databaseNameValidationMessage, "test db"), () -> {
+          glueCatalog.isValidIdentifier(TableIdentifier.of("test db", "table"));
+        });
+    AssertHelpers.assertThrows("Database and Table Name Invalid", ValidationException.class,
+        String.format(databaseNameValidationMessage, "test db"), () -> {
+          glueCatalog.isValidIdentifier(TableIdentifier.of("test db", "ta ble"));
+        });
+    AssertHelpers.assertThrows("Table Name Invalid", ValidationException.class,
+        String.format(tableNameValidationMessage, "test-table"), () -> {
+          glueCatalog.isValidIdentifier(TableIdentifier.of("db", "test-table"));
+        });
+    AssertHelpers.assertThrows("Table Name Invalid", ValidationException.class,
+        String.format(tableNameValidationMessage, Strings.repeat("a", 256)), () -> {

Review comment:
       should there be a similar test with 253 chars for the namespace part? Also might be worth adding tests for empty namespace/table names




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on pull request #3668: AWS: Remove namespace validation from GlueCatalog's isValidIdentifier

Posted by GitBox <gi...@apache.org>.
yyanyy commented on pull request #3668:
URL: https://github.com/apache/iceberg/pull/3668#issuecomment-988200101


   Do we know which engine would encounter issue with a more relaxing database name? I think the point "since at that time the database likely already existed (if not created using Iceberg library)" mentioned in the PR description is the main reason, as otherwise it could be a confusing experience if the user wants to create a table with a preexisting database and we fail validation because of the database name. At least it would be good if we print out exactly which part of the table identifier was violating the rule. 
   
   It definitely would be the best if we can get an answer from Glue team on why we have such inconsistency and the path forward. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on pull request #3668: AWS: Remove namespace validation from GlueCatalog's isValidIdentifier

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on pull request #3668:
URL: https://github.com/apache/iceberg/pull/3668#issuecomment-986102657


   Glue is just trying to be generic, and the Athena doc requirement is based on Hive specification. I am trying to not remove this validation unless there is an absolute need. Using special characters cause a lot of incompatibility across different compute engines. Given the fact that people are likely going to create new namespaces for Iceberg tables, I think it's better to just advocate people to use normal characters instead of allowing too many variations.
   
   Please let me know what you think, or if you have some specific use case to satisfy.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick edited a comment on pull request #3668: AWS: Remove namespace validation from GlueCatalog's isValidIdentifier

Posted by GitBox <gi...@apache.org>.
kbendick edited a comment on pull request #3668:
URL: https://github.com/apache/iceberg/pull/3668#issuecomment-986402769


   > Glue is just trying to be generic, and the Athena doc requirement is based on Hive specification. I am trying to not remove this validation unless there is an absolute need. Using special characters cause a lot of incompatibility across different compute engines. Given the fact that people are likely going to create new namespaces for Iceberg tables, I think it's better to just advocate people to use normal characters instead of allowing too many variations.
   > 
   > Please let me know what you think, or if you have some specific use case to satisfy.
   
   I would also go off of the stricter requirement. From my knowledge, Glue is a bit more general purpose from it's earliest days. So I'm not surprised that its possibly more permissive in what it allows to be created via CLI etc. But it's unlikely that its just the Validation that's causing the hold up.
   
   With many catalogs, there are limitations on the database name (or differences in how they're represented by the catalogs TableOperations).
   
   There are also engine-specific considerations, eg Athena (or Trinodb in the general case).
   
   I'm not at AWS (nor have I ever been affiliated), so I would of course defer to @jackye1995. But wanted to provide further insight as to why the change might not be as simple as removing Precondition checks. πŸ™‚


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on pull request #3668: AWS: Make the error message of GlueCatalog's isValidIdentifier more detailed

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on pull request #3668:
URL: https://github.com/apache/iceberg/pull/3668#issuecomment-995283213


   I am a bit hesitated with this change. It changes exception type from `IllegalArgumentException` to `ValidationException`, which will break any system that catches the specific exception type.
   
   I also talked in #3698 , which I will quote here:
   
   > I think that is already tested in `testCreateNamespaceBadName` and `testToDatabaseNameFailure`...But I agree the current `TestGlueCatalog` lacks a lot of features. I am thinking about implementing a proper local Glue client to test the code in unit test properly, let me know if you would like to pick up the task.
   
   I think that is the right way to go about this whole situation, comparing to checking the exact error message.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org