You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/06/02 10:42:19 UTC

[GitHub] [iceberg] sumeetgajjar opened a new pull request, #4942: [Spark][Test]: Check before creating default namespace to avoid noisy AlreadyExistsException in the test logs

sumeetgajjar opened a new pull request, #4942:
URL: https://github.com/apache/iceberg/pull/4942

   This PR aims at removing the unwanted `AlreadyExistsException` stacktrace from the spark-iceberg test logs.
   
   The test logs for spark tests are filled with `AlreadyExistsException(message:Database default already exists)` stacktrace even when all the tests pass.
   
   These `AlreadyExistsException` exceptions are not from Spark per se but from HMS.
   
   When "CREATE NAMESPACE IF NOT EXISTS default" SQL command is executed in Spark, Spark invokes `hive.createDatabase` command. 
   https://github.com/apache/spark/blob/89fdb8a6fb6a669c458891b3abeba236e64b1e89/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala#L574
   
   Hive client invokes internally invokes HMS API to create the database. If the DB already exists HMS throws `AlreadyExistsException`. When the `ifNotExist` flag is set to true, the Hive client simply ignores the exception.
   https://github.com/apache/hive/blob/63326ff775206e59547b6b1332e25279e90ef5ee/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L608-L619
   
   The HMS logs this exception to STDERR and for iceberg tests since a standalone HMS is running in the same JVM as that of the test, these logs are part of the info output of the tests.
   
   This generates a lot of noise in the logs and might overshadow an actual exception.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a diff in pull request #4942: [Spark][Test]: Check before creating default namespace to avoid noisy AlreadyExistsExceptions in test logs

Posted by GitBox <gi...@apache.org>.
pvary commented on code in PR #4942:
URL: https://github.com/apache/iceberg/pull/4942#discussion_r889671062


##########
spark/v3.0/spark/src/test/java/org/apache/iceberg/spark/SparkTestBase.java:
##########
@@ -73,11 +73,7 @@ public static void startMetastoreAndSpark() {
     SparkTestBase.catalog = (HiveCatalog)
         CatalogUtil.loadCatalog(HiveCatalog.class.getName(), "hive", ImmutableMap.of(), hiveConf);
 
-    try {
-      catalog.createNamespace(Namespace.of("default"));
-    } catch (AlreadyExistsException ignored) {
-      // the default namespace already exists. ignore the create error
-    }
+    createNamespace(catalog, Namespace.of("default"));

Review Comment:
   Your answer helped me to understand, why I felt ambiguous about this solution. We are in the same situation here than in the other places, like TestSparkReaderDeletes.java, and there we chose to inline the check.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] sumeetgajjar commented on pull request #4942: [Spark][Test]: Check before creating default namespace to avoid noisy AlreadyExistsExceptions in test logs

Posted by GitBox <gi...@apache.org>.
sumeetgajjar commented on PR #4942:
URL: https://github.com/apache/iceberg/pull/4942#issuecomment-1158095923

   Hi @pvary @kbendick @rdblue,
   Gentle ping to review the new changes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] sumeetgajjar commented on a diff in pull request #4942: [Spark][Test]: Check before creating default namespace to avoid noisy AlreadyExistsExceptions in test logs

Posted by GitBox <gi...@apache.org>.
sumeetgajjar commented on code in PR #4942:
URL: https://github.com/apache/iceberg/pull/4942#discussion_r889599094


##########
spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/SparkTestBaseWithCatalog.java:
##########
@@ -80,13 +80,24 @@ public SparkTestBaseWithCatalog(String catalogName, String implementation, Map<S
     spark.conf().set("spark.sql.catalog." + catalogName, implementation);
     config.forEach((key, value) -> spark.conf().set("spark.sql.catalog." + catalogName + "." + key, value));
 
-    if (config.get("type").equalsIgnoreCase("hadoop")) {
+    boolean isHadoopCatalog = config.get("type").equalsIgnoreCase("hadoop");
+    if (isHadoopCatalog) {
       spark.conf().set("spark.sql.catalog." + catalogName + ".warehouse", "file:" + warehouse);
     }
 
     this.tableName = (catalogName.equals("spark_catalog") ? "" : catalogName + ".") + "default.table";
 
-    sql("CREATE NAMESPACE IF NOT EXISTS default");
+    // When Hive API is invoked to create Namespace/Database using CREATE IF NOT EXISTS, the standalone HMS logs an
+    // AlreadyExistsException exception in the test logs thereby generating unwanted error logs. Thus checking if a
+    // namespace exists before creating one to keep the test logs noise-free.
+    // HMS is not involved in case HadoopCatalog, thus skipping the check for HadoopCatalog.
+    if (isHadoopCatalog) {
+      sql("CREATE NAMESPACE IF NOT EXISTS default");
+    } else {
+      if (!validationNamespaceCatalog.namespaceExists(Namespace.of("default"))) {

Review Comment:
   Hi @pvary - simplified the namespace creation logic further.
    
   However, that failed `TestNamespaceSQL#testListNamespace` test so had to fix it. The fix actually simplifies the assertion logic in `TestNamespaceSQL#testListNamespace` and removes a conditional as well :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on pull request #4942: [Spark][Test]: Check before creating default namespace to avoid noisy AlreadyExistsExceptions in test logs

Posted by GitBox <gi...@apache.org>.
pvary commented on PR #4942:
URL: https://github.com/apache/iceberg/pull/4942#issuecomment-1145171770

   @sumeetgajjar: Could you please check the failures?
   
   I am comfortable with this:
   ```
         Namespace defaultNamespace = Namespace.of("default");
         if (!catalog.namespaceExists(defaultNamespace)) {
           catalog.createNamespace(defaultNamespace);
         }
   ``` 
   
   I am not entirely sure about these:
   ```
       boolean createNamespace = isHadoopCatalog || spark.sql("SHOW NAMESPACES IN " + catalogName)
           .filter("namespace = 'default'")
           .isEmpty();
       if (createNamespace) {
         sql("CREATE NAMESPACE IF NOT EXISTS " + catalogName + ".default");
       }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] sumeetgajjar commented on pull request #4942: [Spark][Test]: Check before creating default namespace to avoid noisy AlreadyExistsExceptions in test logs

Posted by GitBox <gi...@apache.org>.
sumeetgajjar commented on PR #4942:
URL: https://github.com/apache/iceberg/pull/4942#issuecomment-1198396961

   Hi @rdblue and @pvary,
   A gentle ping to review the latest changes.
   If you feel this change is something that we don't need, please let me know as well, I can close this PR :-)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] sumeetgajjar commented on a diff in pull request #4942: [Spark][Test]: Check before creating default namespace to avoid noisy AlreadyExistsExceptions in test logs

Posted by GitBox <gi...@apache.org>.
sumeetgajjar commented on code in PR #4942:
URL: https://github.com/apache/iceberg/pull/4942#discussion_r890266393


##########
spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/SparkTestBaseWithCatalog.java:
##########
@@ -86,7 +86,7 @@ public SparkTestBaseWithCatalog(String catalogName, String implementation, Map<S
 
     this.tableName = (catalogName.equals("spark_catalog") ? "" : catalogName + ".") + "default.table";
 
-    sql("CREATE NAMESPACE IF NOT EXISTS default");
+    createNamespace(validationNamespaceCatalog, Namespace.of("default"));

Review Comment:
   Hi Ryan - thank you for your comment.
   
   > Why is this needed? IF NOT EXISTS should check existence first, right?
   
   Yes - in a way it does perform the check.
   
   When "CREATE NAMESPACE IF NOT EXISTS default" SQL command is executed in Spark, Spark invokes hive.createDatabase command.
   https://github.com/apache/spark/blob/89fdb8a6fb6a669c458891b3abeba236e64b1e89/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala#L574
   
   Hive client invokes internally invokes HMS API to create the database. If the DB already exists HMS throws AlreadyExistsException. When the ifNotExist flag is set to true, the Hive client simply ignores the exception.
   https://github.com/apache/hive/blob/63326ff775206e59547b6b1332e25279e90ef5ee/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L608-L619
   
   The HMS logs this exception to STDERR and for iceberg tests since a standalone HMS is running in the same JVM as that of the test, these logs are part of the info output of the tests.
   
   This generates a lot of noise in the logs and might overshadow an actual exception.
   
   Thus in order to avoid the `AlreadyExistsException` in the info logs, we are performing an explicit check and using the Catalog APIs instead of the Spark SQL to create a namespace.



##########
spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/SparkTestBaseWithCatalog.java:
##########
@@ -86,7 +86,7 @@ public SparkTestBaseWithCatalog(String catalogName, String implementation, Map<S
 
     this.tableName = (catalogName.equals("spark_catalog") ? "" : catalogName + ".") + "default.table";
 
-    sql("CREATE NAMESPACE IF NOT EXISTS default");
+    createNamespace(validationNamespaceCatalog, Namespace.of("default"));

Review Comment:
   Hi @rdblue - thank you for your comment.
   
   > Why is this needed? IF NOT EXISTS should check existence first, right?
   
   Yes - in a way it does perform the check.
   
   When "CREATE NAMESPACE IF NOT EXISTS default" SQL command is executed in Spark, Spark invokes hive.createDatabase command.
   https://github.com/apache/spark/blob/89fdb8a6fb6a669c458891b3abeba236e64b1e89/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala#L574
   
   Hive client invokes internally invokes HMS API to create the database. If the DB already exists HMS throws AlreadyExistsException. When the ifNotExist flag is set to true, the Hive client simply ignores the exception.
   https://github.com/apache/hive/blob/63326ff775206e59547b6b1332e25279e90ef5ee/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L608-L619
   
   The HMS logs this exception to STDERR and for iceberg tests since a standalone HMS is running in the same JVM as that of the test, these logs are part of the info output of the tests.
   
   This generates a lot of noise in the logs and might overshadow an actual exception.
   
   Thus in order to avoid the `AlreadyExistsException` in the info logs, we are performing an explicit check and using the Catalog APIs instead of the Spark SQL to create a namespace.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on pull request #4942: [Spark][Test]: Check before creating default namespace to avoid noisy AlreadyExistsExceptions in test logs

Posted by GitBox <gi...@apache.org>.
kbendick commented on PR #4942:
URL: https://github.com/apache/iceberg/pull/4942#issuecomment-1145199927

   > @sumeetgajjar: Could you please check the failures?
   > 
   > I am comfortable with this:
   > 
   > ```
   >       Namespace defaultNamespace = Namespace.of("default");
   >       if (!catalog.namespaceExists(defaultNamespace)) {
   >         catalog.createNamespace(defaultNamespace);
   >       }
   > ```
   > 
   > I am not entirely sure about these:
   > 
   > ```
   >     boolean createNamespace = isHadoopCatalog || spark.sql("SHOW NAMESPACES IN " + catalogName)
   >         .filter("namespace = 'default'")
   >         .isEmpty();
   >     if (createNamespace) {
   >       sql("CREATE NAMESPACE IF NOT EXISTS " + catalogName + ".default");
   >     }
   > ```
   
   Ditto. I also have concerns about the additional usage of `spark.sql` (and the general complexity of that part). I also consider this to be a normal-ish log, along the lines of "illegal reflective access" and similar things.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4942: [Spark][Test]: Check before creating default namespace to avoid noisy AlreadyExistsExceptions in test logs

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4942:
URL: https://github.com/apache/iceberg/pull/4942#discussion_r889720509


##########
spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/SparkTestBaseWithCatalog.java:
##########
@@ -86,7 +86,7 @@ public SparkTestBaseWithCatalog(String catalogName, String implementation, Map<S
 
     this.tableName = (catalogName.equals("spark_catalog") ? "" : catalogName + ".") + "default.table";
 
-    sql("CREATE NAMESPACE IF NOT EXISTS default");
+    createNamespace(validationNamespaceCatalog, Namespace.of("default"));

Review Comment:
   Why is this needed? `IF NOT EXISTS` should check existence first, right?



##########
spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/sql/TestNamespaceSQL.java:
##########
@@ -134,15 +134,9 @@ public void testListNamespace() {
 
     List<Object[]> namespaces = sql("SHOW NAMESPACES IN %s", catalogName);
 
-    if (isHadoopCatalog) {
-      Assert.assertEquals("Should have 1 namespace", 1, namespaces.size());
-      Set<String> namespaceNames = namespaces.stream().map(arr -> arr[0].toString()).collect(Collectors.toSet());
-      Assert.assertEquals("Should have only db namespace", ImmutableSet.of("db"), namespaceNames);
-    } else {
-      Assert.assertEquals("Should have 2 namespaces", 2, namespaces.size());
-      Set<String> namespaceNames = namespaces.stream().map(arr -> arr[0].toString()).collect(Collectors.toSet());
-      Assert.assertEquals("Should have default and db namespaces", ImmutableSet.of("default", "db"), namespaceNames);
-    }
+    Assert.assertEquals("Should have 2 namespaces", 2, namespaces.size());
+    Set<String> namespaceNames = namespaces.stream().map(arr -> arr[0].toString()).collect(Collectors.toSet());
+    Assert.assertEquals("Should have default and db namespaces", ImmutableSet.of("default", "db"), namespaceNames);

Review Comment:
   What changed this behavior?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] sumeetgajjar commented on pull request #4942: [Spark][Test]: Check before creating default namespace to avoid noisy AlreadyExistsExceptions in test logs

Posted by GitBox <gi...@apache.org>.
sumeetgajjar commented on PR #4942:
URL: https://github.com/apache/iceberg/pull/4942#issuecomment-1144718581

   Hi @kbendick, @RussellSpitzer, @rdblue, @aokolnychyi 
   Can you please review this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #4942: [Spark][Test]: Check before creating default namespace to avoid noisy AlreadyExistsExceptions in test logs

Posted by GitBox <gi...@apache.org>.
rdblue commented on PR #4942:
URL: https://github.com/apache/iceberg/pull/4942#issuecomment-1199953835

   @sumeetgajjar, I'm fine with these changes, although I think we still need to find out why changing how a database gets created caused it to exist for Hadoop. Feel free to rebase and we can work on this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] sumeetgajjar commented on pull request #4942: [Spark][Test]: Check before creating default namespace to avoid noisy AlreadyExistsExceptions in test logs

Posted by GitBox <gi...@apache.org>.
sumeetgajjar commented on PR #4942:
URL: https://github.com/apache/iceberg/pull/4942#issuecomment-1376419686

   Thanks for the reviews @rdblue @pvary, really appreciate it.
   
   There was an internal redistribution of tasks and a change of ownership of some GitHub PRs due to internal changes.
   I am closing this PR and moving forward @wypoon or @sririshindra would be going leading the changes suggested in this PR.
   They would pull the public branch associated with this PR, add their commits on top of it and create a new PR to ease the logistics of the transition.
   
   Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on pull request #4942: [Spark][Test]: Check before creating default namespace to avoid noisy AlreadyExistsExceptions in test logs

Posted by GitBox <gi...@apache.org>.
pvary commented on PR #4942:
URL: https://github.com/apache/iceberg/pull/4942#issuecomment-1144894005

   Is this exception in the logs is soo confusing, so this worth the complexity in the test code? Coming from Hive background seeing this exception seems perfectly okay, but I have 6 years of indoctrination so I am not a representative example.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] sumeetgajjar commented on pull request #4942: [Spark][Test]: Check before creating default namespace to avoid noisy AlreadyExistsExceptions in test logs

Posted by GitBox <gi...@apache.org>.
sumeetgajjar commented on PR #4942:
URL: https://github.com/apache/iceberg/pull/4942#issuecomment-1145161981

   Hi @pvary, thank you for your comments.
   
   > Is this exception in the logs is soo confusing, so this worth the complexity in the test code?
   
   In this case, the refactored test code does add additional checks but I believe those are justifiable given they reduce noise from the test logs. IMO, the less noise we generate during tests the better it is while triaging test failures. Especially spark tests where we also have tons of logs from the spark as well.
   
   > Coming from Hive background seeing this exception seems perfectly okay, but I have 6 years of indoctrination so I am not a representative example.
   
   Maybe, I believe it is subjective. 
   
   I was recently triaging a couple of spark-iceberg test failures in our downstream iceberg branches and the test logs were full of the above exception stacktrace. The actual failures were shadowed by them and it took me a while to narrow down on the right stacktrace.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a diff in pull request #4942: [Spark][Test]: Check before creating default namespace to avoid noisy AlreadyExistsExceptions in test logs

Posted by GitBox <gi...@apache.org>.
pvary commented on code in PR #4942:
URL: https://github.com/apache/iceberg/pull/4942#discussion_r889614912


##########
spark/v3.0/spark/src/test/java/org/apache/iceberg/spark/SparkTestBase.java:
##########
@@ -73,11 +73,7 @@ public static void startMetastoreAndSpark() {
     SparkTestBase.catalog = (HiveCatalog)
         CatalogUtil.loadCatalog(HiveCatalog.class.getName(), "hive", ImmutableMap.of(), hiveConf);
 
-    try {
-      catalog.createNamespace(Namespace.of("default"));
-    } catch (AlreadyExistsException ignored) {
-      // the default namespace already exists. ignore the create error
-    }
+    createNamespace(catalog, Namespace.of("default"));

Review Comment:
   This is really just a question: does it worth to create a twoliner method which is used only 2 places?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a diff in pull request #4942: [Spark][Test]: Check before creating default namespace to avoid noisy AlreadyExistsExceptions in test logs

Posted by GitBox <gi...@apache.org>.
pvary commented on code in PR #4942:
URL: https://github.com/apache/iceberg/pull/4942#discussion_r888834181


##########
spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/SparkTestBaseWithCatalog.java:
##########
@@ -80,13 +80,24 @@ public SparkTestBaseWithCatalog(String catalogName, String implementation, Map<S
     spark.conf().set("spark.sql.catalog." + catalogName, implementation);
     config.forEach((key, value) -> spark.conf().set("spark.sql.catalog." + catalogName + "." + key, value));
 
-    if (config.get("type").equalsIgnoreCase("hadoop")) {
+    boolean isHadoopCatalog = config.get("type").equalsIgnoreCase("hadoop");
+    if (isHadoopCatalog) {
       spark.conf().set("spark.sql.catalog." + catalogName + ".warehouse", "file:" + warehouse);
     }
 
     this.tableName = (catalogName.equals("spark_catalog") ? "" : catalogName + ".") + "default.table";
 
-    sql("CREATE NAMESPACE IF NOT EXISTS default");
+    // When Hive API is invoked to create Namespace/Database using CREATE IF NOT EXISTS, the standalone HMS logs an
+    // AlreadyExistsException exception in the test logs thereby generating unwanted error logs. Thus checking if a
+    // namespace exists before creating one to keep the test logs noise-free.
+    // HMS is not involved in case HadoopCatalog, thus skipping the check for HadoopCatalog.
+    if (isHadoopCatalog) {
+      sql("CREATE NAMESPACE IF NOT EXISTS default");
+    } else {
+      if (!validationNamespaceCatalog.namespaceExists(Namespace.of("default"))) {

Review Comment:
   Would this work for HadoopCatalog as well? So we can just keep the code more simple?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] sumeetgajjar commented on pull request #4942: [Spark][Test]: Check before creating default namespace to avoid noisy AlreadyExistsExceptions in test logs

Posted by GitBox <gi...@apache.org>.
sumeetgajjar commented on PR #4942:
URL: https://github.com/apache/iceberg/pull/4942#issuecomment-1145198977

   > Could you please check the failures?
   
   Fixed the failures. 
   
   > I am not entirely sure about these:
   
   @pvary simplified the logic a bit, can you please check again? 
   
   
   Also, can you please approve the workflows?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] sumeetgajjar commented on a diff in pull request #4942: [Spark][Test]: Check before creating default namespace to avoid noisy AlreadyExistsExceptions in test logs

Posted by GitBox <gi...@apache.org>.
sumeetgajjar commented on code in PR #4942:
URL: https://github.com/apache/iceberg/pull/4942#discussion_r889666334


##########
spark/v3.0/spark/src/test/java/org/apache/iceberg/spark/SparkTestBase.java:
##########
@@ -73,11 +73,7 @@ public static void startMetastoreAndSpark() {
     SparkTestBase.catalog = (HiveCatalog)
         CatalogUtil.loadCatalog(HiveCatalog.class.getName(), "hive", ImmutableMap.of(), hiveConf);
 
-    try {
-      catalog.createNamespace(Namespace.of("default"));
-    } catch (AlreadyExistsException ignored) {
-      // the default namespace already exists. ignore the create error
-    }
+    createNamespace(catalog, Namespace.of("default"));

Review Comment:
   In this case, it would be a good idea to have that method to ensure the namespace is created in the same manner i.e using Catalog methods.
   I'm fine with removing the method if required.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] sumeetgajjar commented on a diff in pull request #4942: [Spark][Test]: Check before creating default namespace to avoid noisy AlreadyExistsExceptions in test logs

Posted by GitBox <gi...@apache.org>.
sumeetgajjar commented on code in PR #4942:
URL: https://github.com/apache/iceberg/pull/4942#discussion_r890276891


##########
spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/sql/TestNamespaceSQL.java:
##########
@@ -134,15 +134,9 @@ public void testListNamespace() {
 
     List<Object[]> namespaces = sql("SHOW NAMESPACES IN %s", catalogName);
 
-    if (isHadoopCatalog) {
-      Assert.assertEquals("Should have 1 namespace", 1, namespaces.size());
-      Set<String> namespaceNames = namespaces.stream().map(arr -> arr[0].toString()).collect(Collectors.toSet());
-      Assert.assertEquals("Should have only db namespace", ImmutableSet.of("db"), namespaceNames);
-    } else {
-      Assert.assertEquals("Should have 2 namespaces", 2, namespaces.size());
-      Set<String> namespaceNames = namespaces.stream().map(arr -> arr[0].toString()).collect(Collectors.toSet());
-      Assert.assertEquals("Should have default and db namespaces", ImmutableSet.of("default", "db"), namespaceNames);
-    }
+    Assert.assertEquals("Should have 2 namespaces", 2, namespaces.size());
+    Set<String> namespaceNames = namespaces.stream().map(arr -> arr[0].toString()).collect(Collectors.toSet());
+    Assert.assertEquals("Should have default and db namespaces", ImmutableSet.of("default", "db"), namespaceNames);

Review Comment:
   The assertion checks if the namespace that `sql("CREATE NAMESPACE %s", fullNamespace);` creats is in fact created or not.
   
   Now, instead of having only one namespace i.e. "db" which is created using `CREATE NAMESPACE` stmt we have two; one is "db" and the other is "default" that is created in the constructor of the `SparkTestBaseWithCatalog` class during setup.
   
   Assuming the objective of the assertion was to check if the "db" namespace was created or not and even after the changes that this PR introduces the aim is still met, thus there's no change in behavior. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] sumeetgajjar commented on a diff in pull request #4942: [Spark][Test]: Check before creating default namespace to avoid noisy AlreadyExistsExceptions in test logs

Posted by GitBox <gi...@apache.org>.
sumeetgajjar commented on code in PR #4942:
URL: https://github.com/apache/iceberg/pull/4942#discussion_r889671887


##########
spark/v3.0/spark/src/test/java/org/apache/iceberg/spark/SparkTestBase.java:
##########
@@ -73,11 +73,7 @@ public static void startMetastoreAndSpark() {
     SparkTestBase.catalog = (HiveCatalog)
         CatalogUtil.loadCatalog(HiveCatalog.class.getName(), "hive", ImmutableMap.of(), hiveConf);
 
-    try {
-      catalog.createNamespace(Namespace.of("default"));
-    } catch (AlreadyExistsException ignored) {
-      // the default namespace already exists. ignore the create error
-    }
+    createNamespace(catalog, Namespace.of("default"));

Review Comment:
   > We are in the same situation here than in the other places, like TestSparkReaderDeletes.java, and there we chose to inline the check.
   
   `SparkTestBaseWithCatalog` extends `SparkTestBase`, thus the method can be reused in the child class. 
   However, `TestSparkReaderDeletes` does not fall in the above inheritance hierarchy, thus we had to inline it. 
   
   There also does not exist a TestUtil class for Spark tests or else we could have added the method to the util class and reused it everywhere.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a diff in pull request #4942: [Spark][Test]: Check before creating default namespace to avoid noisy AlreadyExistsExceptions in test logs

Posted by GitBox <gi...@apache.org>.
pvary commented on code in PR #4942:
URL: https://github.com/apache/iceberg/pull/4942#discussion_r889606806


##########
spark/v2.4/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkReaderDeletes.java:
##########
@@ -85,7 +85,10 @@ public static void startMetastoreAndSpark() {
         CatalogUtil.loadCatalog(HiveCatalog.class.getName(), "hive", ImmutableMap.of(), hiveConf);
 
     try {
-      catalog.createNamespace(Namespace.of("default"));
+      Namespace defaultNamespace = Namespace.of("default");

Review Comment:
   This is test code. I would remove anything which is not strictly needed.
   
   Also getting an AlreadyExistsException after checking for existence should be a bug if there is no concurrency.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] sumeetgajjar commented on a diff in pull request #4942: [Spark][Test]: Check before creating default namespace to avoid noisy AlreadyExistsExceptions in test logs

Posted by GitBox <gi...@apache.org>.
sumeetgajjar commented on code in PR #4942:
URL: https://github.com/apache/iceberg/pull/4942#discussion_r889605701


##########
spark/v2.4/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkReaderDeletes.java:
##########
@@ -85,7 +85,10 @@ public static void startMetastoreAndSpark() {
         CatalogUtil.loadCatalog(HiveCatalog.class.getName(), "hive", ImmutableMap.of(), hiveConf);
 
     try {
-      catalog.createNamespace(Namespace.of("default"));
+      Namespace defaultNamespace = Namespace.of("default");

Review Comment:
   In a perfect world no. 
   However, this is just due diligence to avoid any unwanted failures. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a diff in pull request #4942: [Spark][Test]: Check before creating default namespace to avoid noisy AlreadyExistsExceptions in test logs

Posted by GitBox <gi...@apache.org>.
pvary commented on code in PR #4942:
URL: https://github.com/apache/iceberg/pull/4942#discussion_r933068130


##########
spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/sql/TestNamespaceSQL.java:
##########
@@ -134,15 +134,9 @@ public void testListNamespace() {
 
     List<Object[]> namespaces = sql("SHOW NAMESPACES IN %s", catalogName);
 
-    if (isHadoopCatalog) {
-      Assert.assertEquals("Should have 1 namespace", 1, namespaces.size());
-      Set<String> namespaceNames = namespaces.stream().map(arr -> arr[0].toString()).collect(Collectors.toSet());
-      Assert.assertEquals("Should have only db namespace", ImmutableSet.of("db"), namespaceNames);
-    } else {
-      Assert.assertEquals("Should have 2 namespaces", 2, namespaces.size());
-      Set<String> namespaceNames = namespaces.stream().map(arr -> arr[0].toString()).collect(Collectors.toSet());
-      Assert.assertEquals("Should have default and db namespaces", ImmutableSet.of("default", "db"), namespaceNames);
-    }
+    Assert.assertEquals("Should have 2 namespaces", 2, namespaces.size());
+    Set<String> namespaceNames = namespaces.stream().map(arr -> arr[0].toString()).collect(Collectors.toSet());
+    Assert.assertEquals("Should have default and db namespaces", ImmutableSet.of("default", "db"), namespaceNames);

Review Comment:
   The question is whether the test aims to test the creation using the `CREATE NAMESPACE IF NOT EXISTS default` sql too, or the only goal was to test if the NS is created by using the catalog API.
   
   Someone with more knowledge about the Spark test could be good here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] sumeetgajjar commented on pull request #4942: [Spark][Test]: Check before creating default namespace to avoid noisy AlreadyExistsExceptions in test logs

Posted by GitBox <gi...@apache.org>.
sumeetgajjar commented on PR #4942:
URL: https://github.com/apache/iceberg/pull/4942#issuecomment-1144720250

   cc: @marton-bod @pvary 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] sumeetgajjar commented on pull request #4942: [Spark][Test]: Check before creating default namespace to avoid noisy AlreadyExistsExceptions in test logs

Posted by GitBox <gi...@apache.org>.
sumeetgajjar commented on PR #4942:
URL: https://github.com/apache/iceberg/pull/4942#issuecomment-1145614512

   Hi @pvary @kbendick,
   Thank you for the comments.
   
   > I also consider this to be a normal-ish log, along the lines of "illegal reflective access" and similar things.
   
   I agree that some exceptions (in this case `AlreadyExistsException` and as Kyle mentioned "illegal reflective access") can be considered normal after prolonged exposure to them and we grow oblivious to them. 
   
   However, I still feel if we can avoid them in the first place, then we should. 
   
   > I also have concerns about the additional usage of spark.sql (and the general complexity of that part).
   
   I've removed the use of `spark.sql` and simplified the namespace exists check logic to simply use `catalog.namespaceExists`. Please let me know how you feel about those. 
   
   If it still feels complex, I can revert the changes in `SparkTestBaseWithCatalog` and simply keep the part in `SparkTestBase` that was deemed comfortable.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] sumeetgajjar commented on a diff in pull request #4942: [Spark][Test]: Check before creating default namespace to avoid noisy AlreadyExistsExceptions in test logs

Posted by GitBox <gi...@apache.org>.
sumeetgajjar commented on code in PR #4942:
URL: https://github.com/apache/iceberg/pull/4942#discussion_r889599094


##########
spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/SparkTestBaseWithCatalog.java:
##########
@@ -80,13 +80,24 @@ public SparkTestBaseWithCatalog(String catalogName, String implementation, Map<S
     spark.conf().set("spark.sql.catalog." + catalogName, implementation);
     config.forEach((key, value) -> spark.conf().set("spark.sql.catalog." + catalogName + "." + key, value));
 
-    if (config.get("type").equalsIgnoreCase("hadoop")) {
+    boolean isHadoopCatalog = config.get("type").equalsIgnoreCase("hadoop");
+    if (isHadoopCatalog) {
       spark.conf().set("spark.sql.catalog." + catalogName + ".warehouse", "file:" + warehouse);
     }
 
     this.tableName = (catalogName.equals("spark_catalog") ? "" : catalogName + ".") + "default.table";
 
-    sql("CREATE NAMESPACE IF NOT EXISTS default");
+    // When Hive API is invoked to create Namespace/Database using CREATE IF NOT EXISTS, the standalone HMS logs an
+    // AlreadyExistsException exception in the test logs thereby generating unwanted error logs. Thus checking if a
+    // namespace exists before creating one to keep the test logs noise-free.
+    // HMS is not involved in case HadoopCatalog, thus skipping the check for HadoopCatalog.
+    if (isHadoopCatalog) {
+      sql("CREATE NAMESPACE IF NOT EXISTS default");
+    } else {
+      if (!validationNamespaceCatalog.namespaceExists(Namespace.of("default"))) {

Review Comment:
   Hi @pvary - simplified the namespace creation logic further.
    
   However, that failed `TestNamespaceSQL#testListNamespace` test so had to fix it. The fix actually simplifies the assertion logic in `TestNamespaceSQL#testListNamespace` and removes a conditional as well :)
   
   cc: @kbendick 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a diff in pull request #4942: [Spark][Test]: Check before creating default namespace to avoid noisy AlreadyExistsExceptions in test logs

Posted by GitBox <gi...@apache.org>.
pvary commented on code in PR #4942:
URL: https://github.com/apache/iceberg/pull/4942#discussion_r889605055


##########
spark/v2.4/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkReaderDeletes.java:
##########
@@ -85,7 +85,10 @@ public static void startMetastoreAndSpark() {
         CatalogUtil.loadCatalog(HiveCatalog.class.getName(), "hive", ImmutableMap.of(), hiveConf);
 
     try {
-      catalog.createNamespace(Namespace.of("default"));
+      Namespace defaultNamespace = Namespace.of("default");

Review Comment:
   Do we need the try-catch after this change?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] sumeetgajjar commented on a diff in pull request #4942: [Spark][Test]: Check before creating default namespace to avoid noisy AlreadyExistsExceptions in test logs

Posted by GitBox <gi...@apache.org>.
sumeetgajjar commented on code in PR #4942:
URL: https://github.com/apache/iceberg/pull/4942#discussion_r889614169


##########
spark/v2.4/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkReaderDeletes.java:
##########
@@ -85,7 +85,10 @@ public static void startMetastoreAndSpark() {
         CatalogUtil.loadCatalog(HiveCatalog.class.getName(), "hive", ImmutableMap.of(), hiveConf);
 
     try {
-      catalog.createNamespace(Namespace.of("default"));
+      Namespace defaultNamespace = Namespace.of("default");

Review Comment:
   Noted, removed try-catch while creating namespace in the latest commit :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] sumeetgajjar closed pull request #4942: [Spark][Test]: Check before creating default namespace to avoid noisy AlreadyExistsExceptions in test logs

Posted by GitBox <gi...@apache.org>.
sumeetgajjar closed pull request #4942: [Spark][Test]: Check before creating default namespace to avoid noisy AlreadyExistsExceptions in test logs
URL: https://github.com/apache/iceberg/pull/4942


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org