You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/07/07 21:48:41 UTC

[GitHub] [iceberg] kbendick opened a new pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

kbendick opened a new pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792


   …er catalog
   
   Allows users to set hadoop configuration overrides on any Iceberg tables that come from an Iceberg enabled Spark catalog.
   
   Users specify the configurations similar to specifying global hadoop configuration overrides on the spark session via the spark config.
   
   E.g. for a table `foo`, to override a hadoop config property `fs.s3a.max.connections`, a config would be added to the spark session config via `--conf spark.sql.catalog.foo.hadoop.fs.s3a.max.connections=4`.
   
   For now this only works for Spark catalogs, in the future we should consider making this possible for other engines.
   
   This closes https://github.com/apache/iceberg/issues/2607


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer merged pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

Posted by GitBox <gi...@apache.org>.
RussellSpitzer merged pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792#issuecomment-882892930


   Solves #2607 - Thanks @kbendick! Thanks @pvary, @nastra , @flyrain and @szehon-ho for reviewing!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792#discussion_r670647285



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java
##########
@@ -99,4 +103,32 @@ public static void validatePartitionTransforms(PartitionSpec spec) {
       }
     }
   }
+
+  /**
+   * Pulls any Catalog specific overrides for the Hadoop conf from the current SparkSession, which can be
+   * set via spark.sql.catalog.$catalogName.hadoop.*
+   *
+   * The SparkCatalog allows for hadoop configurations to be overridden per catalog, by setting
+   * them on the SQLConf, where the following will add the property "fs.default.name" with value
+   * "hdfs://hanksnamenode:8020" to the catalog's hadoop configuration.
+   *   SparkSession.builder()
+   *     .config(s"spark.sql.catalog.$catalogName.hadoop.fs.default.name", "hdfs://hanksnamenode:8020")
+   *     .getOrCreate()
+   * @param spark The current Spark session
+   * @param catalogName Name of the catalog to find overrides for.
+   * @return the Hadoop Configuration that should be used for this catalog, with catalog specific overrides applied.
+   */
+  public static Configuration hadoopConfCatalogOverrides(SparkSession spark, String catalogName) {
+    // Find keys for the catalog intended to be hadoop configurations
+    final String hadoopConfCatalogPrefix = String.format("%s.%s.%s", ICEBERG_CATALOG_PREFIX, catalogName, "hadoop.");
+    Configuration conf = spark.sessionState().newHadoopConf();
+    spark.sqlContext().conf().settings().forEach((k, v) -> {
+      // These checks are copied from `spark.sessionState().newHadoopConfWithOptions()`, which we

Review comment:
       I was not aware that we lost a synchronize call. I will investigate that and add that in.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

Posted by GitBox <gi...@apache.org>.
kbendick commented on pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792#issuecomment-877396902


   > I believe this is per catalog, not per table, right? Does it also support table level setting?
   > 
   > > E.g. for a table foo, to override a hadoop config property fs.s3a.max.connections, a config would be added to the spark session config via --conf spark.sql.catalog.foo.hadoop.fs.s3a.max.connections=4.
   
   That's correct. I will update that to mention catalog instead (as there's no way to change the hadoop config for multiple Iceberg tables in the same catalog).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792#discussion_r670719456



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java
##########
@@ -32,12 +33,15 @@
 import org.apache.iceberg.transforms.Transform;
 import org.apache.iceberg.transforms.UnknownTransform;
 import org.apache.iceberg.util.Pair;
+import org.apache.spark.sql.SparkSession;
 import org.apache.spark.util.SerializableConfiguration;
 
 public class SparkUtil {
   private SparkUtil() {
   }
 
+  public static final String ICEBERG_CATALOG_PREFIX = "spark.sql.catalog";

Review comment:
       For the name, I felt like `Spark` was already in the class name, but looking at it further I can see how it's not as clear as what you suggest. Updating.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] flyrain commented on pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

Posted by GitBox <gi...@apache.org>.
flyrain commented on pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792#issuecomment-876815245


   I believe this is per catalog, not per table, right? Are we also support table level setting?
   
   > E.g. for a table foo, to override a hadoop config property fs.s3a.max.connections, a config would be added to the spark session config via --conf spark.sql.catalog.foo.hadoop.fs.s3a.max.connections=4.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792#discussion_r670646476



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java
##########
@@ -32,12 +33,15 @@
 import org.apache.iceberg.transforms.Transform;
 import org.apache.iceberg.transforms.UnknownTransform;
 import org.apache.iceberg.util.Pair;
+import org.apache.spark.sql.SparkSession;
 import org.apache.spark.util.SerializableConfiguration;
 
 public class SparkUtil {
   private SparkUtil() {
   }
 
+  public static final String ICEBERG_CATALOG_PREFIX = "spark.sql.catalog";

Review comment:
       I can update this to be `private static final` and change the name.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792#discussion_r667258301



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java
##########
@@ -99,4 +103,30 @@ public static void validatePartitionTransforms(PartitionSpec spec) {
       }
     }
   }
+
+  /**
+   * Pulls any Catalog specific overrides for the Hadoop conf from the current SparkSession, which can be
+   * set via spark.sql.catalog.$catalogName.hadoop.*
+   *
+   * The SparkCatalog allows for hadoop configurations to be overridden per catalog, by setting
+   * them on the SQLConf, where the following will add the property "fs.default.name" with value
+   * "hdfs://hanksnamenode:8020" to the catalog's hadoop configuration.
+   *   SparkSession.builder()
+   *     .config(s"spark.sql.catalog.$catalogName.hadoop.fs.default.name", "hdfs://hanksnamenode:8020")
+   *     .getOrCreate()
+   * @param spark The current Spark session
+   * @param catalogName Name of the catalog to find overrides for.
+   * @return the Hadoop Configuration that should be used for this catalog, with catalog specific overrides applied.
+   */
+  public static Configuration hadoopConfCatalogOverrides(SparkSession spark, String catalogName) {
+    // Find keys for the catalog intended to be hadoop configurations
+    final String hadoopConfCatalogPrefix = String.format("%s.%s.%s", ICEBERG_CATALOG_PREFIX, catalogName, "hadoop.");
+    Configuration conf = spark.sessionState().newHadoopConf();
+    spark.sqlContext().conf().settings().forEach((k, v) -> {
+      if (v != null && k.startsWith(hadoopConfCatalogPrefix)) {

Review comment:
       Added a check for `k != null` as well. 🙂 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

Posted by GitBox <gi...@apache.org>.
kbendick commented on pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792#issuecomment-881113629


   @pvary @marton-bod would you mind taking a look at this as you were in the original discussion on the issue (and are much more informed on the subject of hadoop configs than myself 🙂 )


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] flyrain commented on a change in pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

Posted by GitBox <gi...@apache.org>.
flyrain commented on a change in pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792#discussion_r667255910



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java
##########
@@ -99,4 +103,30 @@ public static void validatePartitionTransforms(PartitionSpec spec) {
       }
     }
   }
+
+  /**
+   * Pulls any Catalog specific overrides for the Hadoop conf from the current SparkSession, which can be
+   * set via spark.sql.catalog.$catalogName.hadoop.*
+   *
+   * The SparkCatalog allows for hadoop configurations to be overridden per catalog, by setting
+   * them on the SQLConf, where the following will add the property "fs.default.name" with value
+   * "hdfs://hanksnamenode:8020" to the catalog's hadoop configuration.
+   *   SparkSession.builder()
+   *     .config(s"spark.sql.catalog.$catalogName.hadoop.fs.default.name", "hdfs://hanksnamenode:8020")
+   *     .getOrCreate()
+   * @param spark The current Spark session
+   * @param catalogName Name of the catalog to find overrides for.
+   * @return the Hadoop Configuration that should be used for this catalog, with catalog specific overrides applied.
+   */
+  public static Configuration hadoopConfCatalogOverrides(SparkSession spark, String catalogName) {
+    // Find keys for the catalog intended to be hadoop configurations
+    final String hadoopConfCatalogPrefix = String.format("%s.%s.%s", ICEBERG_CATALOG_PREFIX, catalogName, "hadoop.");
+    Configuration conf = spark.sessionState().newHadoopConf();
+    spark.sqlContext().conf().settings().forEach((k, v) -> {
+      if (v != null && k.startsWith(hadoopConfCatalogPrefix)) {

Review comment:
       Thanks for the test. For checking if k is null, it's OK to leave it, but safer to add it. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792#discussion_r670843623



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java
##########
@@ -99,4 +103,32 @@ public static void validatePartitionTransforms(PartitionSpec spec) {
       }
     }
   }
+
+  /**
+   * Pulls any Catalog specific overrides for the Hadoop conf from the current SparkSession, which can be
+   * set via spark.sql.catalog.$catalogName.hadoop.*
+   *
+   * The SparkCatalog allows for hadoop configurations to be overridden per catalog, by setting
+   * them on the SQLConf, where the following will add the property "fs.default.name" with value
+   * "hdfs://hanksnamenode:8020" to the catalog's hadoop configuration.
+   *   SparkSession.builder()
+   *     .config(s"spark.sql.catalog.$catalogName.hadoop.fs.default.name", "hdfs://hanksnamenode:8020")
+   *     .getOrCreate()
+   * @param spark The current Spark session
+   * @param catalogName Name of the catalog to find overrides for.
+   * @return the Hadoop Configuration that should be used for this catalog, with catalog specific overrides applied.
+   */
+  public static Configuration hadoopConfCatalogOverrides(SparkSession spark, String catalogName) {
+    // Find keys for the catalog intended to be hadoop configurations
+    final String hadoopConfCatalogPrefix = String.format("%s.%s.%s", ICEBERG_CATALOG_PREFIX, catalogName, "hadoop.");
+    Configuration conf = spark.sessionState().newHadoopConf();
+    spark.sqlContext().conf().settings().forEach((k, v) -> {
+      // These checks are copied from `spark.sessionState().newHadoopConfWithOptions()`, which we

Review comment:
       I added the `synchronized` call as you mentioned, but I'm not sure it's necessary to be honest. If you see the note on the new code, the `forEach` method already uses `synchronize` on itself.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792#discussion_r671558620



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java
##########
@@ -170,4 +174,31 @@ public static boolean useTimestampWithoutZoneInNewTables(RuntimeConfig sessionCo
     return false;
   }
 
+  /**
+   * Pulls any Catalog specific overrides for the Hadoop conf from the current SparkSession, which can be
+   * set via spark.sql.catalog.$catalogName.hadoop.*
+   *
+   * The SparkCatalog allows for hadoop configurations to be overridden per catalog, by setting
+   * them on the SQLConf, where the following will add the property "fs.default.name" with value
+   * "hdfs://hanksnamenode:8020" to the catalog's hadoop configuration.
+   *   SparkSession.builder()
+   *     .config(s"spark.sql.catalog.$catalogName.hadoop.fs.default.name", "hdfs://hanksnamenode:8020")
+   *     .getOrCreate()
+   * @param spark The current Spark session
+   * @param catalogName Name of the catalog to find overrides for.
+   * @return the Hadoop Configuration that should be used for this catalog, with catalog specific overrides applied.
+   */
+  public static Configuration hadoopConfCatalogOverrides(SparkSession spark, String catalogName) {
+    // Find keys for the catalog intended to be hadoop configurations
+    final String hadoopConfCatalogPrefix = String.format("%s.%s.%s", SPARK_CATALOG_CONF_PREFIX, catalogName, "hadoop.");

Review comment:
       Please let me know if you think the current format string is better (or if you have any suggestions).
   
   Having spent time looking through this, I'd be happy to try to find a more generic solution for the future, as generally speaking all engines likely have this request / need in some form or another.
   
   Thanks again @pvary!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792#discussion_r667006996



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java
##########
@@ -99,4 +103,30 @@ public static void validatePartitionTransforms(PartitionSpec spec) {
       }
     }
   }
+
+  /**
+   * Pulls any Catalog specific overrides for the Hadoop conf from the current SparkSession, which can be
+   * set via spark.sql.catalog.$catalogName.hadoop.*
+   *
+   * The SparkCatalog allows for hadoop configurations to be overridden per catalog, by setting
+   * them on the SQLConf, where the following will add the property "fs.default.name" with value
+   * "hdfs://hanksnamenode:8020" to the catalog's hadoop configuration.
+   *   SparkSession.builder()
+   *     .config(s"spark.sql.catalog.$catalogName.hadoop.fs.default.name", "hdfs://hanksnamenode:8020")
+   *     .getOrCreate()
+   * @param spark The current Spark session
+   * @param catalogName Name of the catalog to find overrides for.
+   * @return the Hadoop Configuration that should be used for this catalog, with catalog specific overrides applied.
+   */
+  public static Configuration hadoopConfCatalogOverrides(SparkSession spark, String catalogName) {
+    // Find keys for the catalog intended to be hadoop configurations
+    final String hadoopConfCatalogPrefix = String.format("%s.%s.%s", ICEBERG_CATALOG_PREFIX, catalogName, "hadoop.");
+    Configuration conf = spark.sessionState().newHadoopConf();
+    spark.sqlContext().conf().settings().forEach((k, v) -> {
+      if (v != null && k.startsWith(hadoopConfCatalogPrefix)) {

Review comment:
       I don't think a key can be null here since it's really just a Map under the hood




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792#discussion_r667164640



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java
##########
@@ -99,4 +103,30 @@ public static void validatePartitionTransforms(PartitionSpec spec) {
       }
     }
   }
+
+  /**
+   * Pulls any Catalog specific overrides for the Hadoop conf from the current SparkSession, which can be
+   * set via spark.sql.catalog.$catalogName.hadoop.*
+   *
+   * The SparkCatalog allows for hadoop configurations to be overridden per catalog, by setting
+   * them on the SQLConf, where the following will add the property "fs.default.name" with value
+   * "hdfs://hanksnamenode:8020" to the catalog's hadoop configuration.
+   *   SparkSession.builder()
+   *     .config(s"spark.sql.catalog.$catalogName.hadoop.fs.default.name", "hdfs://hanksnamenode:8020")
+   *     .getOrCreate()
+   * @param spark The current Spark session
+   * @param catalogName Name of the catalog to find overrides for.
+   * @return the Hadoop Configuration that should be used for this catalog, with catalog specific overrides applied.
+   */
+  public static Configuration hadoopConfCatalogOverrides(SparkSession spark, String catalogName) {
+    // Find keys for the catalog intended to be hadoop configurations
+    final String hadoopConfCatalogPrefix = String.format("%s.%s.%s", ICEBERG_CATALOG_PREFIX, catalogName, "hadoop.");
+    Configuration conf = spark.sessionState().newHadoopConf();
+    spark.sqlContext().conf().settings().forEach((k, v) -> {
+      if (v != null && k.startsWith(hadoopConfCatalogPrefix)) {

Review comment:
       I can check `k` for null, but it wasn't checked in the `newHadoopConfWithOptions`. Possibly because `scala.Map` cannot have `null` keys?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792#issuecomment-882892930


   Solves #2607 - Thanks @kbendick! Thanks @pvary, @nastra , @flyrain and @szehon-ho for reviewing!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] flyrain commented on a change in pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

Posted by GitBox <gi...@apache.org>.
flyrain commented on a change in pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792#discussion_r666585131



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java
##########
@@ -99,4 +103,30 @@ public static void validatePartitionTransforms(PartitionSpec spec) {
       }
     }
   }
+
+  /**
+   * Pulls any Catalog specific overrides for the Hadoop conf from the current SparkSession, which can be
+   * set via spark.sql.catalog.$catalogName.hadoop.*
+   *
+   * The SparkCatalog allows for hadoop configurations to be overridden per catalog, by setting
+   * them on the SQLConf, where the following will add the property "fs.default.name" with value
+   * "hdfs://hanksnamenode:8020" to the catalog's hadoop configuration.
+   *   SparkSession.builder()
+   *     .config(s"spark.sql.catalog.$catalogName.hadoop.fs.default.name", "hdfs://hanksnamenode:8020")
+   *     .getOrCreate()
+   * @param spark The current Spark session
+   * @param catalogName Name of the catalog to find overrides for.
+   * @return the Hadoop Configuration that should be used for this catalog, with catalog specific overrides applied.
+   */
+  public static Configuration hadoopConfCatalogOverrides(SparkSession spark, String catalogName) {
+    // Find keys for the catalog intended to be hadoop configurations
+    final String hadoopConfCatalogPrefix = String.format("%s.%s.%s", ICEBERG_CATALOG_PREFIX, catalogName, "hadoop.");
+    Configuration conf = spark.sessionState().newHadoopConf();
+    spark.sqlContext().conf().settings().forEach((k, v) -> {
+      if (v != null && k.startsWith(hadoopConfCatalogPrefix)) {

Review comment:
       We need to check if k is null, otherwise it hits a NPE.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792#discussion_r667174160



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java
##########
@@ -99,4 +103,30 @@ public static void validatePartitionTransforms(PartitionSpec spec) {
       }
     }
   }
+
+  /**
+   * Pulls any Catalog specific overrides for the Hadoop conf from the current SparkSession, which can be
+   * set via spark.sql.catalog.$catalogName.hadoop.*
+   *
+   * The SparkCatalog allows for hadoop configurations to be overridden per catalog, by setting
+   * them on the SQLConf, where the following will add the property "fs.default.name" with value
+   * "hdfs://hanksnamenode:8020" to the catalog's hadoop configuration.
+   *   SparkSession.builder()
+   *     .config(s"spark.sql.catalog.$catalogName.hadoop.fs.default.name", "hdfs://hanksnamenode:8020")
+   *     .getOrCreate()
+   * @param spark The current Spark session
+   * @param catalogName Name of the catalog to find overrides for.
+   * @return the Hadoop Configuration that should be used for this catalog, with catalog specific overrides applied.
+   */
+  public static Configuration hadoopConfCatalogOverrides(SparkSession spark, String catalogName) {
+    // Find keys for the catalog intended to be hadoop configurations
+    final String hadoopConfCatalogPrefix = String.format("%s.%s.%s", ICEBERG_CATALOG_PREFIX, catalogName, "hadoop.");
+    Configuration conf = spark.sessionState().newHadoopConf();
+    spark.sqlContext().conf().settings().forEach((k, v) -> {
+      if (v != null && k.startsWith(hadoopConfCatalogPrefix)) {

Review comment:
       I was able to put a `null` key into a `scala.collection.mutable.Map[String, String]`.
   ```
   scala> val nullString: String = null
   nullString: String = null
   
   scala> x += nullString -> "5"
   res3: scala.collection.mutable.Map[String,String] = Map(null -> 5)
   ```
   
   However, putting a `null` key into the hadoop configuration throws:
   ```
   scala> val config = spark.sessionState.newHadoopConf
   config: org.apache.hadoop.conf.Configuration = Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, __spark_hadoop_conf__.xml
   
   scala> config.set(null, "10")
   java.lang.IllegalArgumentException: Property name must not be null
     at com.google.common.base.Preconditions.checkArgument(Preconditions.java:92)
     at org.apache.hadoop.conf.Configuration.set(Configuration.java:1353)
     at org.apache.hadoop.conf.Configuration.set(Configuration.java:1337)
     ... 47 elided
   ```
   
   I think that `spark.sqlContext().conf().settings()` shouldn't return a `null` key, but I can add a check just in case if we think it's a good idea.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a change in pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on a change in pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792#discussion_r670866690



##########
File path: spark3/src/test/java/org/apache/iceberg/spark/source/TestSparkCatalogHadoopOverrides.java
##########
@@ -0,0 +1,127 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.source;
+
+import java.util.Map;
+import org.apache.hadoop.conf.Configurable;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.iceberg.KryoHelpers;
+import org.apache.iceberg.SerializableTable;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.TestHelpers;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.spark.SparkCatalog;
+import org.apache.iceberg.spark.SparkCatalogTestBase;
+import org.apache.iceberg.spark.SparkSessionCatalog;
+import org.apache.spark.sql.connector.catalog.Identifier;
+import org.apache.spark.sql.connector.catalog.TableCatalog;
+import org.junit.After;
+import org.junit.Assert;
+import org.junit.Before;
+import org.junit.Test;
+import org.junit.runners.Parameterized;
+
+
+public class TestSparkCatalogHadoopOverrides extends SparkCatalogTestBase {
+
+  @Parameterized.Parameters(name = "catalogName = {0}, implementation = {1}, config = {2}")
+  public static Object[][] parameters() {
+    return new Object[][] {
+        { "testhive", SparkCatalog.class.getName(),
+          ImmutableMap.of(
+            "type", "hive",
+            "default-namespace", "default",
+            "hadoop.fs.s3a.buffer.dir", "/tmp-overridden"
+        ) },
+        { "testhadoop", SparkCatalog.class.getName(),
+           ImmutableMap.of(
+            "type", "hadoop",
+            "hadoop.fs.s3a.buffer.dir", "/tmp-overridden"
+           ) },
+        { "spark_catalog", SparkSessionCatalog.class.getName(),
+          ImmutableMap.of(
+            "type", "hive",
+            "default-namespace", "default",
+            "hadoop.fs.s3a.buffer.dir", "/tmp-overridden"
+        ) }
+    };
+  }
+
+  public TestSparkCatalogHadoopOverrides(String catalogName,
+                                         String implementation,
+                                         Map<String, String> config) {
+    super(catalogName, implementation, config);
+  }
+
+  @Before
+  public void createTable() {
+    sql("CREATE TABLE IF NOT EXISTS %s (id bigint) USING iceberg", tableName(tableIdent.name()));
+  }
+
+  @After
+  public void dropTable() {
+    sql("DROP TABLE IF EXISTS %s", tableName(tableIdent.name()));
+  }
+
+  @Test
+  public void testTableFromCatalogHasOverrides() throws Exception {
+    Table table = getIcebergTableFromSparkCatalog();
+    Configuration conf = ((Configurable) table.io()).getConf();
+    String catalogOverrideFromTable = conf.get("fs.s3a.buffer.dir", "/whammies");
+    Assert.assertEquals(
+        "Iceberg tables from spark should have the overridden hadoop configurations from the spark config",
+        "/tmp-overridden", catalogOverrideFromTable);
+  }
+
+  @Test
+  public void ensureRoundTripSerializedTableRetainsHadoopConfig() throws Exception {
+    Table table = getIcebergTableFromSparkCatalog();
+    Configuration originalConf = ((Configurable) table.io()).getConf();
+    String catalogOverrideFromTable = originalConf.get("fs.s3a.buffer.dir", "/whammies");
+    Assert.assertEquals(

Review comment:
       Yea I see, a bit shame it adds to build runtime, but fine with me




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer merged pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

Posted by GitBox <gi...@apache.org>.
RussellSpitzer merged pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

Posted by GitBox <gi...@apache.org>.
kbendick commented on pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792#issuecomment-876039804


   @rdblue Could we possibly add this to the 0.12 release track? It's a relatively small change but it would be beneficial to us to have this released with 0.12 for users who might maintain their own environments.
   
   If not possible, no worries.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792#discussion_r667155175



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java
##########
@@ -99,4 +103,30 @@ public static void validatePartitionTransforms(PartitionSpec spec) {
       }
     }
   }
+
+  /**
+   * Pulls any Catalog specific overrides for the Hadoop conf from the current SparkSession, which can be
+   * set via spark.sql.catalog.$catalogName.hadoop.*
+   *
+   * The SparkCatalog allows for hadoop configurations to be overridden per catalog, by setting
+   * them on the SQLConf, where the following will add the property "fs.default.name" with value
+   * "hdfs://hanksnamenode:8020" to the catalog's hadoop configuration.
+   *   SparkSession.builder()
+   *     .config(s"spark.sql.catalog.$catalogName.hadoop.fs.default.name", "hdfs://hanksnamenode:8020")
+   *     .getOrCreate()
+   * @param spark The current Spark session
+   * @param catalogName Name of the catalog to find overrides for.
+   * @return the Hadoop Configuration that should be used for this catalog, with catalog specific overrides applied.
+   */
+  public static Configuration hadoopConfCatalogOverrides(SparkSession spark, String catalogName) {
+    // Find keys for the catalog intended to be hadoop configurations
+    final String hadoopConfCatalogPrefix = String.format("%s.%s.%s", ICEBERG_CATALOG_PREFIX, catalogName, "hadoop.");
+    Configuration conf = spark.sessionState().newHadoopConf();
+    spark.sqlContext().conf().settings().forEach((k, v) -> {
+      if (v != null && k.startsWith(hadoopConfCatalogPrefix)) {

Review comment:
       I pulled these checks from the `newHadoopConfWithOptions` method (which takes in a scala.Map - as opposed to `newHadoopConf`). It checks for `v != null && k != "path" && k != "paths"`. So I simplified it to `v != null and k.startsWith` as that will cover the checks for path.
   
   The reason I didn't use that function is to avoid having to convert back and forth between java Map and scala Map (which was quite ugly comparatively speaking).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792#discussion_r667156038



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java
##########
@@ -99,4 +103,30 @@ public static void validatePartitionTransforms(PartitionSpec spec) {
       }
     }
   }
+
+  /**
+   * Pulls any Catalog specific overrides for the Hadoop conf from the current SparkSession, which can be
+   * set via spark.sql.catalog.$catalogName.hadoop.*
+   *
+   * The SparkCatalog allows for hadoop configurations to be overridden per catalog, by setting
+   * them on the SQLConf, where the following will add the property "fs.default.name" with value
+   * "hdfs://hanksnamenode:8020" to the catalog's hadoop configuration.
+   *   SparkSession.builder()
+   *     .config(s"spark.sql.catalog.$catalogName.hadoop.fs.default.name", "hdfs://hanksnamenode:8020")
+   *     .getOrCreate()
+   * @param spark The current Spark session
+   * @param catalogName Name of the catalog to find overrides for.
+   * @return the Hadoop Configuration that should be used for this catalog, with catalog specific overrides applied.
+   */
+  public static Configuration hadoopConfCatalogOverrides(SparkSession spark, String catalogName) {
+    // Find keys for the catalog intended to be hadoop configurations
+    final String hadoopConfCatalogPrefix = String.format("%s.%s.%s", ICEBERG_CATALOG_PREFIX, catalogName, "hadoop.");
+    Configuration conf = spark.sessionState().newHadoopConf();
+    spark.sqlContext().conf().settings().forEach((k, v) -> {
+      if (v != null && k.startsWith(hadoopConfCatalogPrefix)) {

Review comment:
       I can leave a comment there if we'd like about where I pulled that from?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792#discussion_r667174160



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java
##########
@@ -99,4 +103,30 @@ public static void validatePartitionTransforms(PartitionSpec spec) {
       }
     }
   }
+
+  /**
+   * Pulls any Catalog specific overrides for the Hadoop conf from the current SparkSession, which can be
+   * set via spark.sql.catalog.$catalogName.hadoop.*
+   *
+   * The SparkCatalog allows for hadoop configurations to be overridden per catalog, by setting
+   * them on the SQLConf, where the following will add the property "fs.default.name" with value
+   * "hdfs://hanksnamenode:8020" to the catalog's hadoop configuration.
+   *   SparkSession.builder()
+   *     .config(s"spark.sql.catalog.$catalogName.hadoop.fs.default.name", "hdfs://hanksnamenode:8020")
+   *     .getOrCreate()
+   * @param spark The current Spark session
+   * @param catalogName Name of the catalog to find overrides for.
+   * @return the Hadoop Configuration that should be used for this catalog, with catalog specific overrides applied.
+   */
+  public static Configuration hadoopConfCatalogOverrides(SparkSession spark, String catalogName) {
+    // Find keys for the catalog intended to be hadoop configurations
+    final String hadoopConfCatalogPrefix = String.format("%s.%s.%s", ICEBERG_CATALOG_PREFIX, catalogName, "hadoop.");
+    Configuration conf = spark.sessionState().newHadoopConf();
+    spark.sqlContext().conf().settings().forEach((k, v) -> {
+      if (v != null && k.startsWith(hadoopConfCatalogPrefix)) {

Review comment:
       I was able to put a `null` key into a `scala.Map[String, String]`.
   ```
   scala> var nullString: String = null
   nullString: String = null
   
   scala> x += nullString -> "5"
   res3: scala.collection.mutable.Map[String,String] = Map(null -> 5)
   ```
   
   However, putting a `null` key into the hadoop configuration throws:
   ```
   scala> var config = spark.sessionState.newHadoopConf
   config: org.apache.hadoop.conf.Configuration = Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, __spark_hadoop_conf__.xml
   
   scala> config.set(null, "10")
   java.lang.IllegalArgumentException: Property name must not be null
     at com.google.common.base.Preconditions.checkArgument(Preconditions.java:92)
     at org.apache.hadoop.conf.Configuration.set(Configuration.java:1353)
     at org.apache.hadoop.conf.Configuration.set(Configuration.java:1337)
     ... 47 elided
   ```
   
   I think that `spark.sqlContext().conf().settings()` shouldn't return a `null` key, but I can add a check just in case if we think it's a good idea.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a change in pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792#discussion_r671099247



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java
##########
@@ -170,4 +174,31 @@ public static boolean useTimestampWithoutZoneInNewTables(RuntimeConfig sessionCo
     return false;
   }
 
+  /**
+   * Pulls any Catalog specific overrides for the Hadoop conf from the current SparkSession, which can be
+   * set via spark.sql.catalog.$catalogName.hadoop.*
+   *
+   * The SparkCatalog allows for hadoop configurations to be overridden per catalog, by setting
+   * them on the SQLConf, where the following will add the property "fs.default.name" with value
+   * "hdfs://hanksnamenode:8020" to the catalog's hadoop configuration.
+   *   SparkSession.builder()
+   *     .config(s"spark.sql.catalog.$catalogName.hadoop.fs.default.name", "hdfs://hanksnamenode:8020")
+   *     .getOrCreate()
+   * @param spark The current Spark session
+   * @param catalogName Name of the catalog to find overrides for.
+   * @return the Hadoop Configuration that should be used for this catalog, with catalog specific overrides applied.
+   */
+  public static Configuration hadoopConfCatalogOverrides(SparkSession spark, String catalogName) {
+    // Find keys for the catalog intended to be hadoop configurations
+    final String hadoopConfCatalogPrefix = String.format("%s.%s.%s", SPARK_CATALOG_CONF_PREFIX, catalogName, "hadoop.");

Review comment:
       I `"hadoop."` should be a `private static final String`.
   
   Still not fan of using `hadoop.` instead of `override.` as mentioned on the other issue, since we will need to have the similar things for Hive as well for hive configurations too. This again will cause confusion for users who are using Hive and Spark as well.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792#discussion_r670805240



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java
##########
@@ -99,4 +103,32 @@ public static void validatePartitionTransforms(PartitionSpec spec) {
       }
     }
   }
+
+  /**
+   * Pulls any Catalog specific overrides for the Hadoop conf from the current SparkSession, which can be
+   * set via spark.sql.catalog.$catalogName.hadoop.*
+   *
+   * The SparkCatalog allows for hadoop configurations to be overridden per catalog, by setting
+   * them on the SQLConf, where the following will add the property "fs.default.name" with value
+   * "hdfs://hanksnamenode:8020" to the catalog's hadoop configuration.
+   *   SparkSession.builder()
+   *     .config(s"spark.sql.catalog.$catalogName.hadoop.fs.default.name", "hdfs://hanksnamenode:8020")
+   *     .getOrCreate()
+   * @param spark The current Spark session
+   * @param catalogName Name of the catalog to find overrides for.
+   * @return the Hadoop Configuration that should be used for this catalog, with catalog specific overrides applied.
+   */
+  public static Configuration hadoopConfCatalogOverrides(SparkSession spark, String catalogName) {
+    // Find keys for the catalog intended to be hadoop configurations
+    final String hadoopConfCatalogPrefix = String.format("%s.%s.%s", ICEBERG_CATALOG_PREFIX, catalogName, "hadoop.");
+    Configuration conf = spark.sessionState().newHadoopConf();
+    spark.sqlContext().conf().settings().forEach((k, v) -> {
+      // These checks are copied from `spark.sessionState().newHadoopConfWithOptions()`, which we

Review comment:
       So I looked into this further, and you're right that `settings` is an instance of `java.util.collections.SynchronizedMap`.
   
   But the `forEach` method of `SynchronizedMap` already has a `synchronized` block of it in the definition.
   
   I tried it with `synchronized` around it and the unit tests passed still, but I'm wondering if it's necessary given that it synchronizes within the class definition itself on the same object (`this` from the class definition).
   
   I'm going to look into this further and then follow up with you offline about this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick edited a comment on pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

Posted by GitBox <gi...@apache.org>.
kbendick edited a comment on pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792#issuecomment-877396902


   > I believe this is per catalog, not per table, right? Does it also support table level setting?
   > 
   > > E.g. for a table foo, to override a hadoop config property fs.s3a.max.connections, a config would be added to the spark session config via --conf spark.sql.catalog.foo.hadoop.fs.s3a.max.connections=4.
   
   That's correct. I have updated the description to mention catalog instead (as there's no way to change the hadoop config for multiple Iceberg tables in the same catalog).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] flyrain edited a comment on pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

Posted by GitBox <gi...@apache.org>.
flyrain edited a comment on pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792#issuecomment-876815245


   I believe this is per catalog, not per table, right? Does it also support table level setting?
   
   > E.g. for a table foo, to override a hadoop config property fs.s3a.max.connections, a config would be added to the spark session config via --conf spark.sql.catalog.foo.hadoop.fs.s3a.max.connections=4.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] flyrain commented on a change in pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

Posted by GitBox <gi...@apache.org>.
flyrain commented on a change in pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792#discussion_r666579600



##########
File path: spark3/src/test/java/org/apache/iceberg/spark/source/TestSparkCatalogHadoopOverrides.java
##########
@@ -0,0 +1,129 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.source;
+
+import java.util.Map;
+import org.apache.hadoop.conf.Configurable;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.iceberg.KryoHelpers;
+import org.apache.iceberg.SerializableTable;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.TestHelpers;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.spark.SparkCatalog;
+import org.apache.iceberg.spark.SparkCatalogTestBase;
+import org.apache.iceberg.spark.SparkSessionCatalog;
+import org.apache.spark.sql.connector.catalog.Identifier;
+import org.apache.spark.sql.connector.catalog.TableCatalog;
+import org.junit.After;
+import org.junit.Assert;
+import org.junit.Before;
+import org.junit.Test;
+import org.junit.runners.Parameterized;
+
+
+public class TestSparkCatalogHadoopOverrides extends SparkCatalogTestBase {
+
+  @Parameterized.Parameters(name = "catalogName = {0}, implementation = {1}, config = {2}")
+  public static Object[][] parameters() {
+    return new Object[][] {
+        { "testhive", SparkCatalog.class.getName(),
+          ImmutableMap.of(
+            "type", "hive",
+            "default-namespace", "default",
+            "hadoop.fs.s3a.buffer.dir", "/tmp-overridden"
+        ) },
+        { "testhadoop", SparkCatalog.class.getName(),
+           ImmutableMap.of(
+            "type", "hadoop",
+            "hadoop.fs.s3a.buffer.dir", "/tmp-overridden"
+           ) },
+        { "spark_catalog", SparkSessionCatalog.class.getName(),
+          ImmutableMap.of(
+            "type", "hive",
+            "default-namespace", "default",
+            "parquet-enabled", "true",
+            "cache-enabled", "false", // Spark will delete tables using v1, leaving the cache out of sync

Review comment:
       A nit: remove them since they are not used.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

Posted by GitBox <gi...@apache.org>.
kbendick commented on pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792#issuecomment-875957720


   cc @RussellSpitzer @aokolnychyi @flyrain @raptond @rdblue 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] flyrain commented on a change in pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

Posted by GitBox <gi...@apache.org>.
flyrain commented on a change in pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792#discussion_r667063660



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java
##########
@@ -99,4 +103,30 @@ public static void validatePartitionTransforms(PartitionSpec spec) {
       }
     }
   }
+
+  /**
+   * Pulls any Catalog specific overrides for the Hadoop conf from the current SparkSession, which can be
+   * set via spark.sql.catalog.$catalogName.hadoop.*
+   *
+   * The SparkCatalog allows for hadoop configurations to be overridden per catalog, by setting
+   * them on the SQLConf, where the following will add the property "fs.default.name" with value
+   * "hdfs://hanksnamenode:8020" to the catalog's hadoop configuration.
+   *   SparkSession.builder()
+   *     .config(s"spark.sql.catalog.$catalogName.hadoop.fs.default.name", "hdfs://hanksnamenode:8020")
+   *     .getOrCreate()
+   * @param spark The current Spark session
+   * @param catalogName Name of the catalog to find overrides for.
+   * @return the Hadoop Configuration that should be used for this catalog, with catalog specific overrides applied.
+   */
+  public static Configuration hadoopConfCatalogOverrides(SparkSession spark, String catalogName) {
+    // Find keys for the catalog intended to be hadoop configurations
+    final String hadoopConfCatalogPrefix = String.format("%s.%s.%s", ICEBERG_CATALOG_PREFIX, catalogName, "hadoop.");
+    Configuration conf = spark.sessionState().newHadoopConf();
+    spark.sqlContext().conf().settings().forEach((k, v) -> {
+      if (v != null && k.startsWith(hadoopConfCatalogPrefix)) {

Review comment:
       Java's HashMap<String, String> allows a key to be null.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792#issuecomment-882892930


   Solves #2607 - Thanks @kbendick! Thanks @pvary, @nastra , @flyrain and @szehon-ho for reviewing!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792#discussion_r671497047



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java
##########
@@ -170,4 +174,31 @@ public static boolean useTimestampWithoutZoneInNewTables(RuntimeConfig sessionCo
     return false;
   }
 
+  /**
+   * Pulls any Catalog specific overrides for the Hadoop conf from the current SparkSession, which can be
+   * set via spark.sql.catalog.$catalogName.hadoop.*
+   *
+   * The SparkCatalog allows for hadoop configurations to be overridden per catalog, by setting
+   * them on the SQLConf, where the following will add the property "fs.default.name" with value
+   * "hdfs://hanksnamenode:8020" to the catalog's hadoop configuration.
+   *   SparkSession.builder()
+   *     .config(s"spark.sql.catalog.$catalogName.hadoop.fs.default.name", "hdfs://hanksnamenode:8020")
+   *     .getOrCreate()
+   * @param spark The current Spark session
+   * @param catalogName Name of the catalog to find overrides for.
+   * @return the Hadoop Configuration that should be used for this catalog, with catalog specific overrides applied.
+   */
+  public static Configuration hadoopConfCatalogOverrides(SparkSession spark, String catalogName) {
+    // Find keys for the catalog intended to be hadoop configurations
+    final String hadoopConfCatalogPrefix = String.format("%s.%s.%s", SPARK_CATALOG_CONF_PREFIX, catalogName, "hadoop.");

Review comment:
       > I "hadoop." should be a private static final String.
   
   Happy to update that.
   
   > Still not fan of using hadoop. instead of override. as mentioned on the other issue, since we will need to have the similar things for Hive as well for hive configurations too. This again will cause confusion for users who are using Hive and Spark as well.
   
   As for using `hadoop.` instead of `override.`, I do think that for spark users this is the most natural way to do it as it aligns with the way that spark offers to use `spark.hadop.*`, which many Spark users naturally look for.
   
   As soon as we determine a better way to make it more generic across catalogs, I'm happy to champion either deprecation efforts of this, or to assist in any way I can with ensuring that both methods work for a reasonable period of time (whatever is decided).
   
   Unfortunately, for now, I'm not sure if we manage to find any consensus around `override.` that would work with other catalogs. I'm happy to help look and try other methods in the source code.
   
   As soon as we find something, I'm happy to help with implementing it, reviewing it, and ensuring the messaging gets out to the community as best as possible.
   
   I would personally prefer to keep this version for Spark, as again this correlates really nicely with `spark.hadoop.*` which many Spark users naturally expect. So I would personally advocate for maintaining both methods, as spark users already have this method to override their job level hadoop configurations.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792#discussion_r667165896



##########
File path: spark3/src/test/java/org/apache/iceberg/spark/source/TestSparkCatalogHadoopOverrides.java
##########
@@ -0,0 +1,129 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.source;
+
+import java.util.Map;
+import org.apache.hadoop.conf.Configurable;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.iceberg.KryoHelpers;
+import org.apache.iceberg.SerializableTable;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.TestHelpers;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.spark.SparkCatalog;
+import org.apache.iceberg.spark.SparkCatalogTestBase;
+import org.apache.iceberg.spark.SparkSessionCatalog;
+import org.apache.spark.sql.connector.catalog.Identifier;
+import org.apache.spark.sql.connector.catalog.TableCatalog;
+import org.junit.After;
+import org.junit.Assert;
+import org.junit.Before;
+import org.junit.Test;
+import org.junit.runners.Parameterized;
+
+
+public class TestSparkCatalogHadoopOverrides extends SparkCatalogTestBase {
+
+  @Parameterized.Parameters(name = "catalogName = {0}, implementation = {1}, config = {2}")
+  public static Object[][] parameters() {
+    return new Object[][] {
+        { "testhive", SparkCatalog.class.getName(),
+          ImmutableMap.of(
+            "type", "hive",
+            "default-namespace", "default",
+            "hadoop.fs.s3a.buffer.dir", "/tmp-overridden"
+        ) },
+        { "testhadoop", SparkCatalog.class.getName(),
+           ImmutableMap.of(
+            "type", "hadoop",
+            "hadoop.fs.s3a.buffer.dir", "/tmp-overridden"
+           ) },
+        { "spark_catalog", SparkSessionCatalog.class.getName(),
+          ImmutableMap.of(
+            "type", "hive",
+            "default-namespace", "default",
+            "parquet-enabled", "true",
+            "cache-enabled", "false", // Spark will delete tables using v1, leaving the cache out of sync

Review comment:
       Removed 👍 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792#discussion_r667174160



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java
##########
@@ -99,4 +103,30 @@ public static void validatePartitionTransforms(PartitionSpec spec) {
       }
     }
   }
+
+  /**
+   * Pulls any Catalog specific overrides for the Hadoop conf from the current SparkSession, which can be
+   * set via spark.sql.catalog.$catalogName.hadoop.*
+   *
+   * The SparkCatalog allows for hadoop configurations to be overridden per catalog, by setting
+   * them on the SQLConf, where the following will add the property "fs.default.name" with value
+   * "hdfs://hanksnamenode:8020" to the catalog's hadoop configuration.
+   *   SparkSession.builder()
+   *     .config(s"spark.sql.catalog.$catalogName.hadoop.fs.default.name", "hdfs://hanksnamenode:8020")
+   *     .getOrCreate()
+   * @param spark The current Spark session
+   * @param catalogName Name of the catalog to find overrides for.
+   * @return the Hadoop Configuration that should be used for this catalog, with catalog specific overrides applied.
+   */
+  public static Configuration hadoopConfCatalogOverrides(SparkSession spark, String catalogName) {
+    // Find keys for the catalog intended to be hadoop configurations
+    final String hadoopConfCatalogPrefix = String.format("%s.%s.%s", ICEBERG_CATALOG_PREFIX, catalogName, "hadoop.");
+    Configuration conf = spark.sessionState().newHadoopConf();
+    spark.sqlContext().conf().settings().forEach((k, v) -> {
+      if (v != null && k.startsWith(hadoopConfCatalogPrefix)) {

Review comment:
       I was able to put a `null` key into a `scala.Map[String, String]`.
   ```
   scala> var nullString: String = null
   nullString: String = null
   
   scala> x += nullString -> "5"
   res3: scala.collection.mutable.Map[String,String] = Map(null -> 5)
   
   scala> x
   res4: scala.collection.mutable.Map[String,String] = Map(null -> 5)
   ```
   
   However, putting a `null` key into the hadoop configuration throws:
   ```
   scala> var config = spark.sessionState.newHadoopConf
   config: org.apache.hadoop.conf.Configuration = Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, __spark_hadoop_conf__.xml
   
   scala> config.set(null, "10")
   java.lang.IllegalArgumentException: Property name must not be null
     at com.google.common.base.Preconditions.checkArgument(Preconditions.java:92)
     at org.apache.hadoop.conf.Configuration.set(Configuration.java:1353)
     at org.apache.hadoop.conf.Configuration.set(Configuration.java:1337)
     ... 47 elided
   ```
   
   I think that `settings` shouldn't return a `null` key, but I can add a check just in case if we think it's a good idea.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792#discussion_r670842978



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java
##########
@@ -170,4 +174,34 @@ public static boolean useTimestampWithoutZoneInNewTables(RuntimeConfig sessionCo
     return false;
   }
 
+  /**
+   * Pulls any Catalog specific overrides for the Hadoop conf from the current SparkSession, which can be
+   * set via spark.sql.catalog.$catalogName.hadoop.*
+   *
+   * The SparkCatalog allows for hadoop configurations to be overridden per catalog, by setting
+   * them on the SQLConf, where the following will add the property "fs.default.name" with value
+   * "hdfs://hanksnamenode:8020" to the catalog's hadoop configuration.
+   *   SparkSession.builder()
+   *     .config(s"spark.sql.catalog.$catalogName.hadoop.fs.default.name", "hdfs://hanksnamenode:8020")
+   *     .getOrCreate()
+   * @param spark The current Spark session
+   * @param catalogName Name of the catalog to find overrides for.
+   * @return the Hadoop Configuration that should be used for this catalog, with catalog specific overrides applied.
+   */
+  public static Configuration hadoopConfCatalogOverrides(SparkSession spark, String catalogName) {
+    // Find keys for the catalog intended to be hadoop configurations
+    final String hadoopConfCatalogPrefix = String.format("%s.%s.%s", SPARK_CATALOG_CONF_PREFIX, catalogName, "hadoop.");
+    final Configuration conf = spark.sessionState().newHadoopConf();
+    // settings is a java.util.Collections.synchronizedMap and needs to be wrapped in `synchronized`.
+    synchronized (spark.sqlContext().conf().settings()) {

Review comment:
       I added `synchronized` as mentioned by @szehon-ho, but looking at the overridden definition of `forEach` for `java.util.Collections.SynchronizedMap`. I don't think this is necessary.
   
   Here's the definition of forEach copied from my IDE. Note that it already synchronizes on itself (which is the `settings()` result here).
   
   ```java
           @Override
           public void forEach(BiConsumer<? super K, ? super V> action) {
               synchronized (mutex) {m.forEach(action);}
           }
   ```
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a change in pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792#discussion_r671501379



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java
##########
@@ -170,4 +174,31 @@ public static boolean useTimestampWithoutZoneInNewTables(RuntimeConfig sessionCo
     return false;
   }
 
+  /**
+   * Pulls any Catalog specific overrides for the Hadoop conf from the current SparkSession, which can be
+   * set via spark.sql.catalog.$catalogName.hadoop.*
+   *
+   * The SparkCatalog allows for hadoop configurations to be overridden per catalog, by setting
+   * them on the SQLConf, where the following will add the property "fs.default.name" with value
+   * "hdfs://hanksnamenode:8020" to the catalog's hadoop configuration.
+   *   SparkSession.builder()
+   *     .config(s"spark.sql.catalog.$catalogName.hadoop.fs.default.name", "hdfs://hanksnamenode:8020")
+   *     .getOrCreate()
+   * @param spark The current Spark session
+   * @param catalogName Name of the catalog to find overrides for.
+   * @return the Hadoop Configuration that should be used for this catalog, with catalog specific overrides applied.
+   */
+  public static Configuration hadoopConfCatalogOverrides(SparkSession spark, String catalogName) {
+    // Find keys for the catalog intended to be hadoop configurations
+    final String hadoopConfCatalogPrefix = String.format("%s.%s.%s", SPARK_CATALOG_CONF_PREFIX, catalogName, "hadoop.");

Review comment:
       Other than this I do not have any concerns wrt the configuration. I totally understand that in this case we can not find a good solution, so we just have to chose one.
   I think after moving the prefix to a constant we can merge the PR. 
   
   Thanks for the patience!
   Peter 
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] flyrain commented on a change in pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

Posted by GitBox <gi...@apache.org>.
flyrain commented on a change in pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792#discussion_r666584697



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java
##########
@@ -99,4 +103,30 @@ public static void validatePartitionTransforms(PartitionSpec spec) {
       }
     }
   }
+
+  /**
+   * Pulls any Catalog specific overrides for the Hadoop conf from the current SparkSession, which can be
+   * set via spark.sql.catalog.$catalogName.hadoop.*
+   *
+   * The SparkCatalog allows for hadoop configurations to be overridden per catalog, by setting
+   * them on the SQLConf, where the following will add the property "fs.default.name" with value
+   * "hdfs://hanksnamenode:8020" to the catalog's hadoop configuration.
+   *   SparkSession.builder()
+   *     .config(s"spark.sql.catalog.$catalogName.hadoop.fs.default.name", "hdfs://hanksnamenode:8020")
+   *     .getOrCreate()
+   * @param spark The current Spark session
+   * @param catalogName Name of the catalog to find overrides for.
+   * @return the Hadoop Configuration that should be used for this catalog, with catalog specific overrides applied.
+   */
+  public static Configuration hadoopConfCatalogOverrides(SparkSession spark, String catalogName) {
+    // Find keys for the catalog intended to be hadoop configurations
+    final String hadoopConfCatalogPrefix = String.format("%s.%s.%s", ICEBERG_CATALOG_PREFIX, catalogName, "hadoop.");
+    Configuration conf = spark.sessionState().newHadoopConf();
+    spark.sqlContext().conf().settings().forEach((k, v) -> {
+      if (v != null && k.startsWith(hadoopConfCatalogPrefix)) {

Review comment:
       Do we need to check if v is null? It's OK to pass the null, right?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a change in pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on a change in pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792#discussion_r669336838



##########
File path: spark3/src/test/java/org/apache/iceberg/spark/source/TestSparkCatalogHadoopOverrides.java
##########
@@ -0,0 +1,127 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.source;
+
+import java.util.Map;
+import org.apache.hadoop.conf.Configurable;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.iceberg.KryoHelpers;
+import org.apache.iceberg.SerializableTable;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.TestHelpers;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.spark.SparkCatalog;
+import org.apache.iceberg.spark.SparkCatalogTestBase;
+import org.apache.iceberg.spark.SparkSessionCatalog;
+import org.apache.spark.sql.connector.catalog.Identifier;
+import org.apache.spark.sql.connector.catalog.TableCatalog;
+import org.junit.After;
+import org.junit.Assert;
+import org.junit.Before;
+import org.junit.Test;
+import org.junit.runners.Parameterized;
+
+
+public class TestSparkCatalogHadoopOverrides extends SparkCatalogTestBase {
+
+  @Parameterized.Parameters(name = "catalogName = {0}, implementation = {1}, config = {2}")
+  public static Object[][] parameters() {
+    return new Object[][] {
+        { "testhive", SparkCatalog.class.getName(),
+          ImmutableMap.of(
+            "type", "hive",
+            "default-namespace", "default",
+            "hadoop.fs.s3a.buffer.dir", "/tmp-overridden"
+        ) },
+        { "testhadoop", SparkCatalog.class.getName(),
+           ImmutableMap.of(
+            "type", "hadoop",
+            "hadoop.fs.s3a.buffer.dir", "/tmp-overridden"
+           ) },
+        { "spark_catalog", SparkSessionCatalog.class.getName(),
+          ImmutableMap.of(
+            "type", "hive",
+            "default-namespace", "default",
+            "hadoop.fs.s3a.buffer.dir", "/tmp-overridden"
+        ) }
+    };
+  }
+
+  public TestSparkCatalogHadoopOverrides(String catalogName,
+                                         String implementation,
+                                         Map<String, String> config) {
+    super(catalogName, implementation, config);
+  }
+
+  @Before
+  public void createTable() {
+    sql("CREATE TABLE IF NOT EXISTS %s (id bigint) USING iceberg", tableName(tableIdent.name()));
+  }
+
+  @After
+  public void dropTable() {
+    sql("DROP TABLE IF EXISTS %s", tableName(tableIdent.name()));
+  }
+
+  @Test
+  public void testTableFromCatalogHasOverrides() throws Exception {
+    Table table = getIcebergTableFromSparkCatalog();
+    Configuration conf = ((Configurable) table.io()).getConf();
+    String catalogOverrideFromTable = conf.get("fs.s3a.buffer.dir", "/whammies");
+    Assert.assertEquals(
+        "Iceberg tables from spark should have the overridden hadoop configurations from the spark config",
+        "/tmp-overridden", catalogOverrideFromTable);
+  }
+
+  @Test
+  public void ensureRoundTripSerializedTableRetainsHadoopConfig() throws Exception {
+    Table table = getIcebergTableFromSparkCatalog();
+    Configuration originalConf = ((Configurable) table.io()).getConf();
+    String catalogOverrideFromTable = originalConf.get("fs.s3a.buffer.dir", "/whammies");
+    Assert.assertEquals(

Review comment:
       Nit: is there any value to run the same test assert again as the previous test?

##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java
##########
@@ -32,12 +33,15 @@
 import org.apache.iceberg.transforms.Transform;
 import org.apache.iceberg.transforms.UnknownTransform;
 import org.apache.iceberg.util.Pair;
+import org.apache.spark.sql.SparkSession;
 import org.apache.spark.util.SerializableConfiguration;
 
 public class SparkUtil {
   private SparkUtil() {
   }
 
+  public static final String ICEBERG_CATALOG_PREFIX = "spark.sql.catalog";

Review comment:
       Nit: the name is maybe a bit misleading?  (spark_catalog_conf_prefix?)
   
   Also not sure, but looks like other constants like this are private static final, ie IcebergSource?

##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java
##########
@@ -99,4 +103,32 @@ public static void validatePartitionTransforms(PartitionSpec spec) {
       }
     }
   }
+
+  /**
+   * Pulls any Catalog specific overrides for the Hadoop conf from the current SparkSession, which can be
+   * set via spark.sql.catalog.$catalogName.hadoop.*
+   *
+   * The SparkCatalog allows for hadoop configurations to be overridden per catalog, by setting
+   * them on the SQLConf, where the following will add the property "fs.default.name" with value
+   * "hdfs://hanksnamenode:8020" to the catalog's hadoop configuration.
+   *   SparkSession.builder()
+   *     .config(s"spark.sql.catalog.$catalogName.hadoop.fs.default.name", "hdfs://hanksnamenode:8020")
+   *     .getOrCreate()
+   * @param spark The current Spark session
+   * @param catalogName Name of the catalog to find overrides for.
+   * @return the Hadoop Configuration that should be used for this catalog, with catalog specific overrides applied.
+   */
+  public static Configuration hadoopConfCatalogOverrides(SparkSession spark, String catalogName) {
+    // Find keys for the catalog intended to be hadoop configurations
+    final String hadoopConfCatalogPrefix = String.format("%s.%s.%s", ICEBERG_CATALOG_PREFIX, catalogName, "hadoop.");
+    Configuration conf = spark.sessionState().newHadoopConf();
+    spark.sqlContext().conf().settings().forEach((k, v) -> {
+      // These checks are copied from `spark.sessionState().newHadoopConfWithOptions()`, which we

Review comment:
       I see we lost the synchronize { } call that we would have get from spark.sessionState.newHadoopConf (via getAllConfs method), should we use it as the SynchronizedMap expectes everyone synchronize when getting the map?  (Esp as this is a public API)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792#discussion_r670860151



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java
##########
@@ -99,4 +103,32 @@ public static void validatePartitionTransforms(PartitionSpec spec) {
       }
     }
   }
+
+  /**
+   * Pulls any Catalog specific overrides for the Hadoop conf from the current SparkSession, which can be
+   * set via spark.sql.catalog.$catalogName.hadoop.*
+   *
+   * The SparkCatalog allows for hadoop configurations to be overridden per catalog, by setting
+   * them on the SQLConf, where the following will add the property "fs.default.name" with value
+   * "hdfs://hanksnamenode:8020" to the catalog's hadoop configuration.
+   *   SparkSession.builder()
+   *     .config(s"spark.sql.catalog.$catalogName.hadoop.fs.default.name", "hdfs://hanksnamenode:8020")
+   *     .getOrCreate()
+   * @param spark The current Spark session
+   * @param catalogName Name of the catalog to find overrides for.
+   * @return the Hadoop Configuration that should be used for this catalog, with catalog specific overrides applied.
+   */
+  public static Configuration hadoopConfCatalogOverrides(SparkSession spark, String catalogName) {
+    // Find keys for the catalog intended to be hadoop configurations
+    final String hadoopConfCatalogPrefix = String.format("%s.%s.%s", ICEBERG_CATALOG_PREFIX, catalogName, "hadoop.");
+    Configuration conf = spark.sessionState().newHadoopConf();
+    spark.sqlContext().conf().settings().forEach((k, v) -> {
+      // These checks are copied from `spark.sessionState().newHadoopConfWithOptions()`, which we

Review comment:
       As mentioned in another thread, @szehon-ho and I looked into it and the call to `synchronize` is not needed in this place as the underlying map does it on its own (it just doesn't do it when used in scala and calling `.toScala` on it).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] nastra commented on a change in pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

Posted by GitBox <gi...@apache.org>.
nastra commented on a change in pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792#discussion_r668460912



##########
File path: spark3/src/test/java/org/apache/iceberg/spark/source/TestSparkCatalogHadoopOverrides.java
##########
@@ -0,0 +1,127 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.source;
+
+import java.util.Map;
+import org.apache.hadoop.conf.Configurable;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.iceberg.KryoHelpers;
+import org.apache.iceberg.SerializableTable;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.TestHelpers;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.spark.SparkCatalog;
+import org.apache.iceberg.spark.SparkCatalogTestBase;
+import org.apache.iceberg.spark.SparkSessionCatalog;
+import org.apache.spark.sql.connector.catalog.Identifier;
+import org.apache.spark.sql.connector.catalog.TableCatalog;
+import org.junit.After;
+import org.junit.Assert;
+import org.junit.Before;
+import org.junit.Test;
+import org.junit.runners.Parameterized;
+
+
+public class TestSparkCatalogHadoopOverrides extends SparkCatalogTestBase {
+
+  @Parameterized.Parameters(name = "catalogName = {0}, implementation = {1}, config = {2}")
+  public static Object[][] parameters() {
+    return new Object[][] {
+        { "testhive", SparkCatalog.class.getName(),
+          ImmutableMap.of(
+            "type", "hive",
+            "default-namespace", "default",
+            "hadoop.fs.s3a.buffer.dir", "/tmp-overridden"
+        ) },
+        { "testhadoop", SparkCatalog.class.getName(),
+           ImmutableMap.of(
+            "type", "hadoop",
+            "hadoop.fs.s3a.buffer.dir", "/tmp-overridden"
+           ) },
+        { "spark_catalog", SparkSessionCatalog.class.getName(),
+          ImmutableMap.of(
+            "type", "hive",
+            "default-namespace", "default",
+            "hadoop.fs.s3a.buffer.dir", "/tmp-overridden"
+        ) }
+    };
+  }
+
+  public TestSparkCatalogHadoopOverrides(String catalogName,
+                                         String implementation,
+                                         Map<String, String> config) {
+    super(catalogName, implementation, config);
+  }
+
+  @Before
+  public void createTable() {
+    sql("CREATE TABLE IF NOT EXISTS %s (id bigint) USING iceberg", tableName(tableIdent.name()));
+  }
+
+  @After
+  public void dropTable() {
+    sql("DROP TABLE IF EXISTS %s", tableName(tableIdent.name()));
+  }
+
+  @Test
+  public void testTableFromCatalogHasOverrides() throws Exception {
+    Table table = getIcebergTableFromSparkCatalog();
+    Configuration conf = ((Configurable) table.io()).getConf();
+    String catalogOverrideFromTable = conf.get("fs.s3a.buffer.dir", "/whammies");

Review comment:
       nit: maybe worth putting `"fs.s3a.buffer.dir"` into a `static final`

##########
File path: spark3/src/test/java/org/apache/iceberg/spark/source/TestSparkCatalogHadoopOverrides.java
##########
@@ -0,0 +1,127 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.source;
+
+import java.util.Map;
+import org.apache.hadoop.conf.Configurable;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.iceberg.KryoHelpers;
+import org.apache.iceberg.SerializableTable;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.TestHelpers;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.spark.SparkCatalog;
+import org.apache.iceberg.spark.SparkCatalogTestBase;
+import org.apache.iceberg.spark.SparkSessionCatalog;
+import org.apache.spark.sql.connector.catalog.Identifier;
+import org.apache.spark.sql.connector.catalog.TableCatalog;
+import org.junit.After;
+import org.junit.Assert;
+import org.junit.Before;
+import org.junit.Test;
+import org.junit.runners.Parameterized;
+
+
+public class TestSparkCatalogHadoopOverrides extends SparkCatalogTestBase {
+
+  @Parameterized.Parameters(name = "catalogName = {0}, implementation = {1}, config = {2}")
+  public static Object[][] parameters() {
+    return new Object[][] {
+        { "testhive", SparkCatalog.class.getName(),
+          ImmutableMap.of(
+            "type", "hive",
+            "default-namespace", "default",
+            "hadoop.fs.s3a.buffer.dir", "/tmp-overridden"
+        ) },
+        { "testhadoop", SparkCatalog.class.getName(),
+           ImmutableMap.of(
+            "type", "hadoop",
+            "hadoop.fs.s3a.buffer.dir", "/tmp-overridden"
+           ) },
+        { "spark_catalog", SparkSessionCatalog.class.getName(),
+          ImmutableMap.of(
+            "type", "hive",
+            "default-namespace", "default",
+            "hadoop.fs.s3a.buffer.dir", "/tmp-overridden"
+        ) }
+    };
+  }
+
+  public TestSparkCatalogHadoopOverrides(String catalogName,
+                                         String implementation,
+                                         Map<String, String> config) {
+    super(catalogName, implementation, config);
+  }
+
+  @Before
+  public void createTable() {
+    sql("CREATE TABLE IF NOT EXISTS %s (id bigint) USING iceberg", tableName(tableIdent.name()));
+  }
+
+  @After
+  public void dropTable() {
+    sql("DROP TABLE IF EXISTS %s", tableName(tableIdent.name()));
+  }
+
+  @Test
+  public void testTableFromCatalogHasOverrides() throws Exception {
+    Table table = getIcebergTableFromSparkCatalog();
+    Configuration conf = ((Configurable) table.io()).getConf();
+    String catalogOverrideFromTable = conf.get("fs.s3a.buffer.dir", "/whammies");
+    Assert.assertEquals(
+        "Iceberg tables from spark should have the overridden hadoop configurations from the spark config",
+        "/tmp-overridden", catalogOverrideFromTable);

Review comment:
       nit: maybe worth putting `"/tmp-overridden"` in a `static final` and use it everywhere




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792#discussion_r667155175



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java
##########
@@ -99,4 +103,30 @@ public static void validatePartitionTransforms(PartitionSpec spec) {
       }
     }
   }
+
+  /**
+   * Pulls any Catalog specific overrides for the Hadoop conf from the current SparkSession, which can be
+   * set via spark.sql.catalog.$catalogName.hadoop.*
+   *
+   * The SparkCatalog allows for hadoop configurations to be overridden per catalog, by setting
+   * them on the SQLConf, where the following will add the property "fs.default.name" with value
+   * "hdfs://hanksnamenode:8020" to the catalog's hadoop configuration.
+   *   SparkSession.builder()
+   *     .config(s"spark.sql.catalog.$catalogName.hadoop.fs.default.name", "hdfs://hanksnamenode:8020")
+   *     .getOrCreate()
+   * @param spark The current Spark session
+   * @param catalogName Name of the catalog to find overrides for.
+   * @return the Hadoop Configuration that should be used for this catalog, with catalog specific overrides applied.
+   */
+  public static Configuration hadoopConfCatalogOverrides(SparkSession spark, String catalogName) {
+    // Find keys for the catalog intended to be hadoop configurations
+    final String hadoopConfCatalogPrefix = String.format("%s.%s.%s", ICEBERG_CATALOG_PREFIX, catalogName, "hadoop.");
+    Configuration conf = spark.sessionState().newHadoopConf();
+    spark.sqlContext().conf().settings().forEach((k, v) -> {
+      if (v != null && k.startsWith(hadoopConfCatalogPrefix)) {

Review comment:
       I pulled these checks from the `newHadoopConfWithOptions` method (which takes in a scala.Map - as opposed to `newHadoopConf`). It checks for `v != null && k != "path" && k != "paths"`. So I simplified it to `v != null and k.startsWith`.
   
   The reason I didn't use that function is to avoid having to convert back and forth between java Map and scala Map (which was quite ugly comparatively speaking).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792#discussion_r667257811



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java
##########
@@ -99,4 +103,30 @@ public static void validatePartitionTransforms(PartitionSpec spec) {
       }
     }
   }
+
+  /**
+   * Pulls any Catalog specific overrides for the Hadoop conf from the current SparkSession, which can be
+   * set via spark.sql.catalog.$catalogName.hadoop.*
+   *
+   * The SparkCatalog allows for hadoop configurations to be overridden per catalog, by setting
+   * them on the SQLConf, where the following will add the property "fs.default.name" with value
+   * "hdfs://hanksnamenode:8020" to the catalog's hadoop configuration.
+   *   SparkSession.builder()
+   *     .config(s"spark.sql.catalog.$catalogName.hadoop.fs.default.name", "hdfs://hanksnamenode:8020")
+   *     .getOrCreate()
+   * @param spark The current Spark session
+   * @param catalogName Name of the catalog to find overrides for.
+   * @return the Hadoop Configuration that should be used for this catalog, with catalog specific overrides applied.
+   */
+  public static Configuration hadoopConfCatalogOverrides(SparkSession spark, String catalogName) {
+    // Find keys for the catalog intended to be hadoop configurations
+    final String hadoopConfCatalogPrefix = String.format("%s.%s.%s", ICEBERG_CATALOG_PREFIX, catalogName, "hadoop.");
+    Configuration conf = spark.sessionState().newHadoopConf();
+    spark.sqlContext().conf().settings().forEach((k, v) -> {
+      if (v != null && k.startsWith(hadoopConfCatalogPrefix)) {

Review comment:
       Ok. Happy to add it. I can't imagine there's any practical performance reason not to, and this function will only be invoked a few times anyway.
   
   Coming from Scala, I'm always scared of null anyways 😜 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792#discussion_r670718019



##########
File path: spark3/src/test/java/org/apache/iceberg/spark/source/TestSparkCatalogHadoopOverrides.java
##########
@@ -0,0 +1,127 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.source;
+
+import java.util.Map;
+import org.apache.hadoop.conf.Configurable;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.iceberg.KryoHelpers;
+import org.apache.iceberg.SerializableTable;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.TestHelpers;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.spark.SparkCatalog;
+import org.apache.iceberg.spark.SparkCatalogTestBase;
+import org.apache.iceberg.spark.SparkSessionCatalog;
+import org.apache.spark.sql.connector.catalog.Identifier;
+import org.apache.spark.sql.connector.catalog.TableCatalog;
+import org.junit.After;
+import org.junit.Assert;
+import org.junit.Before;
+import org.junit.Test;
+import org.junit.runners.Parameterized;
+
+
+public class TestSparkCatalogHadoopOverrides extends SparkCatalogTestBase {
+
+  @Parameterized.Parameters(name = "catalogName = {0}, implementation = {1}, config = {2}")
+  public static Object[][] parameters() {
+    return new Object[][] {
+        { "testhive", SparkCatalog.class.getName(),
+          ImmutableMap.of(
+            "type", "hive",
+            "default-namespace", "default",
+            "hadoop.fs.s3a.buffer.dir", "/tmp-overridden"
+        ) },
+        { "testhadoop", SparkCatalog.class.getName(),
+           ImmutableMap.of(
+            "type", "hadoop",
+            "hadoop.fs.s3a.buffer.dir", "/tmp-overridden"
+           ) },
+        { "spark_catalog", SparkSessionCatalog.class.getName(),
+          ImmutableMap.of(
+            "type", "hive",
+            "default-namespace", "default",
+            "hadoop.fs.s3a.buffer.dir", "/tmp-overridden"
+        ) }
+    };
+  }
+
+  public TestSparkCatalogHadoopOverrides(String catalogName,
+                                         String implementation,
+                                         Map<String, String> config) {
+    super(catalogName, implementation, config);
+  }
+
+  @Before
+  public void createTable() {
+    sql("CREATE TABLE IF NOT EXISTS %s (id bigint) USING iceberg", tableName(tableIdent.name()));
+  }
+
+  @After
+  public void dropTable() {
+    sql("DROP TABLE IF EXISTS %s", tableName(tableIdent.name()));
+  }
+
+  @Test
+  public void testTableFromCatalogHasOverrides() throws Exception {
+    Table table = getIcebergTableFromSparkCatalog();
+    Configuration conf = ((Configurable) table.io()).getConf();
+    String catalogOverrideFromTable = conf.get("fs.s3a.buffer.dir", "/whammies");
+    Assert.assertEquals(
+        "Iceberg tables from spark should have the overridden hadoop configurations from the spark config",
+        "/tmp-overridden", catalogOverrideFromTable);
+  }
+
+  @Test
+  public void ensureRoundTripSerializedTableRetainsHadoopConfig() throws Exception {
+    Table table = getIcebergTableFromSparkCatalog();
+    Configuration originalConf = ((Configurable) table.io()).getConf();
+    String catalogOverrideFromTable = originalConf.get("fs.s3a.buffer.dir", "/whammies");
+    Assert.assertEquals(

Review comment:
       The only benefit is to be able to determine what is causing this test to fail. Does it fail our preconditions (which are checked in the first test) or does it fail post-serialization?
   
   I can remove one or the other, but I prefer having a test that just tests that the value works, and then a separate test for serialization. It's arguably redundant to have the assertion again, but I do prefer to assert on the initial assumption.
   
   I can remove one or the other if you'd prefer.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792#discussion_r665792470



##########
File path: spark3/src/main/java/org/apache/iceberg/spark/SparkCatalog.java
##########
@@ -387,7 +387,7 @@ public final void initialize(String name, CaseInsensitiveStringMap options) {
     Catalog catalog = buildIcebergCatalog(name, options);
 
     this.catalogName = name;
-    this.tables = new HadoopTables(SparkSession.active().sessionState().newHadoopConf());
+    this.tables = new HadoopTables(SparkUtil.hadoopConfCatalogOverrides(SparkSession.active(), name));

Review comment:
       Allow for path based tables in the same catalog to also get the overrides.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792#discussion_r667174160



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java
##########
@@ -99,4 +103,30 @@ public static void validatePartitionTransforms(PartitionSpec spec) {
       }
     }
   }
+
+  /**
+   * Pulls any Catalog specific overrides for the Hadoop conf from the current SparkSession, which can be
+   * set via spark.sql.catalog.$catalogName.hadoop.*
+   *
+   * The SparkCatalog allows for hadoop configurations to be overridden per catalog, by setting
+   * them on the SQLConf, where the following will add the property "fs.default.name" with value
+   * "hdfs://hanksnamenode:8020" to the catalog's hadoop configuration.
+   *   SparkSession.builder()
+   *     .config(s"spark.sql.catalog.$catalogName.hadoop.fs.default.name", "hdfs://hanksnamenode:8020")
+   *     .getOrCreate()
+   * @param spark The current Spark session
+   * @param catalogName Name of the catalog to find overrides for.
+   * @return the Hadoop Configuration that should be used for this catalog, with catalog specific overrides applied.
+   */
+  public static Configuration hadoopConfCatalogOverrides(SparkSession spark, String catalogName) {
+    // Find keys for the catalog intended to be hadoop configurations
+    final String hadoopConfCatalogPrefix = String.format("%s.%s.%s", ICEBERG_CATALOG_PREFIX, catalogName, "hadoop.");
+    Configuration conf = spark.sessionState().newHadoopConf();
+    spark.sqlContext().conf().settings().forEach((k, v) -> {
+      if (v != null && k.startsWith(hadoopConfCatalogPrefix)) {

Review comment:
       I was able to put a `null` key into a `scala.Map[String, String]`.
   ```
   scala> var nullString: String = null
   nullString: String = null
   
   scala> x += nullString -> "5"
   res3: scala.collection.mutable.Map[String,String] = Map(null -> 5)
   
   scala> x
   res4: scala.collection.mutable.Map[String,String] = Map(null -> 5)
   ```
   
   However, putting a `null` key into the hadoop configuration throws:
   ```
   scala> var config = spark.sessionState.newHadoopConf
   config: org.apache.hadoop.conf.Configuration = Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, __spark_hadoop_conf__.xml
   
   scala> config.set(null, "10")
   java.lang.IllegalArgumentException: Property name must not be null
     at com.google.common.base.Preconditions.checkArgument(Preconditions.java:92)
     at org.apache.hadoop.conf.Configuration.set(Configuration.java:1353)
     at org.apache.hadoop.conf.Configuration.set(Configuration.java:1337)
     ... 47 elided
   ```
   
   I think that `spark.sqlContext().conf().settings()` shouldn't return a `null` key, but I can add a check just in case if we think it's a good idea.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792#discussion_r667155175



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java
##########
@@ -99,4 +103,30 @@ public static void validatePartitionTransforms(PartitionSpec spec) {
       }
     }
   }
+
+  /**
+   * Pulls any Catalog specific overrides for the Hadoop conf from the current SparkSession, which can be
+   * set via spark.sql.catalog.$catalogName.hadoop.*
+   *
+   * The SparkCatalog allows for hadoop configurations to be overridden per catalog, by setting
+   * them on the SQLConf, where the following will add the property "fs.default.name" with value
+   * "hdfs://hanksnamenode:8020" to the catalog's hadoop configuration.
+   *   SparkSession.builder()
+   *     .config(s"spark.sql.catalog.$catalogName.hadoop.fs.default.name", "hdfs://hanksnamenode:8020")
+   *     .getOrCreate()
+   * @param spark The current Spark session
+   * @param catalogName Name of the catalog to find overrides for.
+   * @return the Hadoop Configuration that should be used for this catalog, with catalog specific overrides applied.
+   */
+  public static Configuration hadoopConfCatalogOverrides(SparkSession spark, String catalogName) {
+    // Find keys for the catalog intended to be hadoop configurations
+    final String hadoopConfCatalogPrefix = String.format("%s.%s.%s", ICEBERG_CATALOG_PREFIX, catalogName, "hadoop.");
+    Configuration conf = spark.sessionState().newHadoopConf();
+    spark.sqlContext().conf().settings().forEach((k, v) -> {
+      if (v != null && k.startsWith(hadoopConfCatalogPrefix)) {

Review comment:
       I pulled these checks from the `newHadoopConfWithOptions ` method (which takes in a scala.Map). It checks for `v != null && k != "path" && k != "paths"`. So I simplified it to `v != null and k.startsWith`.
   
   The reason I didn't use that function is to avoid having to convert back and forth between java Map and scala Map (which was quite ugly comparatively speaking).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer merged pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

Posted by GitBox <gi...@apache.org>.
RussellSpitzer merged pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792#discussion_r667174160



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java
##########
@@ -99,4 +103,30 @@ public static void validatePartitionTransforms(PartitionSpec spec) {
       }
     }
   }
+
+  /**
+   * Pulls any Catalog specific overrides for the Hadoop conf from the current SparkSession, which can be
+   * set via spark.sql.catalog.$catalogName.hadoop.*
+   *
+   * The SparkCatalog allows for hadoop configurations to be overridden per catalog, by setting
+   * them on the SQLConf, where the following will add the property "fs.default.name" with value
+   * "hdfs://hanksnamenode:8020" to the catalog's hadoop configuration.
+   *   SparkSession.builder()
+   *     .config(s"spark.sql.catalog.$catalogName.hadoop.fs.default.name", "hdfs://hanksnamenode:8020")
+   *     .getOrCreate()
+   * @param spark The current Spark session
+   * @param catalogName Name of the catalog to find overrides for.
+   * @return the Hadoop Configuration that should be used for this catalog, with catalog specific overrides applied.
+   */
+  public static Configuration hadoopConfCatalogOverrides(SparkSession spark, String catalogName) {
+    // Find keys for the catalog intended to be hadoop configurations
+    final String hadoopConfCatalogPrefix = String.format("%s.%s.%s", ICEBERG_CATALOG_PREFIX, catalogName, "hadoop.");
+    Configuration conf = spark.sessionState().newHadoopConf();
+    spark.sqlContext().conf().settings().forEach((k, v) -> {
+      if (v != null && k.startsWith(hadoopConfCatalogPrefix)) {

Review comment:
       I was able to put a `null` key into a `scala.Map[String, String]`.
   ```
   scala> val nullString: String = null
   nullString: String = null
   
   scala> x += nullString -> "5"
   res3: scala.collection.mutable.Map[String,String] = Map(null -> 5)
   ```
   
   However, putting a `null` key into the hadoop configuration throws:
   ```
   scala> val config = spark.sessionState.newHadoopConf
   config: org.apache.hadoop.conf.Configuration = Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, __spark_hadoop_conf__.xml
   
   scala> config.set(null, "10")
   java.lang.IllegalArgumentException: Property name must not be null
     at com.google.common.base.Preconditions.checkArgument(Preconditions.java:92)
     at org.apache.hadoop.conf.Configuration.set(Configuration.java:1353)
     at org.apache.hadoop.conf.Configuration.set(Configuration.java:1337)
     ... 47 elided
   ```
   
   I think that `spark.sqlContext().conf().settings()` shouldn't return a `null` key, but I can add a check just in case if we think it's a good idea.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

Posted by GitBox <gi...@apache.org>.
kbendick commented on pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792#issuecomment-881043381


   I just rebased and fixed the merge conflicts. I've also added a synchronized block @szehon-ho.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792#discussion_r671556664



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java
##########
@@ -170,4 +174,31 @@ public static boolean useTimestampWithoutZoneInNewTables(RuntimeConfig sessionCo
     return false;
   }
 
+  /**
+   * Pulls any Catalog specific overrides for the Hadoop conf from the current SparkSession, which can be
+   * set via spark.sql.catalog.$catalogName.hadoop.*
+   *
+   * The SparkCatalog allows for hadoop configurations to be overridden per catalog, by setting
+   * them on the SQLConf, where the following will add the property "fs.default.name" with value
+   * "hdfs://hanksnamenode:8020" to the catalog's hadoop configuration.
+   *   SparkSession.builder()
+   *     .config(s"spark.sql.catalog.$catalogName.hadoop.fs.default.name", "hdfs://hanksnamenode:8020")
+   *     .getOrCreate()
+   * @param spark The current Spark session
+   * @param catalogName Name of the catalog to find overrides for.
+   * @return the Hadoop Configuration that should be used for this catalog, with catalog specific overrides applied.
+   */
+  public static Configuration hadoopConfCatalogOverrides(SparkSession spark, String catalogName) {
+    // Find keys for the catalog intended to be hadoop configurations
+    final String hadoopConfCatalogPrefix = String.format("%s.%s.%s", SPARK_CATALOG_CONF_PREFIX, catalogName, "hadoop.");

Review comment:
       I admittedly couldn't figure out what to name the `private static final String` for `hadoop.`, so I made a format string that is `private static final` as well as a `private static` function used to get the full prefix key for the catalog.
   
   If you can think of a better name, by all means let me know. I thought maybe `SPARK_CATALOG_HADOOP_CONF_PREFIX_OVERRIDE_PART`, but that's terribly long.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792#discussion_r667164077



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java
##########
@@ -99,4 +103,30 @@ public static void validatePartitionTransforms(PartitionSpec spec) {
       }
     }
   }
+
+  /**
+   * Pulls any Catalog specific overrides for the Hadoop conf from the current SparkSession, which can be
+   * set via spark.sql.catalog.$catalogName.hadoop.*
+   *
+   * The SparkCatalog allows for hadoop configurations to be overridden per catalog, by setting
+   * them on the SQLConf, where the following will add the property "fs.default.name" with value
+   * "hdfs://hanksnamenode:8020" to the catalog's hadoop configuration.
+   *   SparkSession.builder()
+   *     .config(s"spark.sql.catalog.$catalogName.hadoop.fs.default.name", "hdfs://hanksnamenode:8020")
+   *     .getOrCreate()
+   * @param spark The current Spark session
+   * @param catalogName Name of the catalog to find overrides for.
+   * @return the Hadoop Configuration that should be used for this catalog, with catalog specific overrides applied.
+   */
+  public static Configuration hadoopConfCatalogOverrides(SparkSession spark, String catalogName) {
+    // Find keys for the catalog intended to be hadoop configurations
+    final String hadoopConfCatalogPrefix = String.format("%s.%s.%s", ICEBERG_CATALOG_PREFIX, catalogName, "hadoop.");
+    Configuration conf = spark.sessionState().newHadoopConf();
+    spark.sqlContext().conf().settings().forEach((k, v) -> {
+      if (v != null && k.startsWith(hadoopConfCatalogPrefix)) {

Review comment:
       I left a comment re: where these checks came from.
   
   It's possible that maybe we _do_ want to allow for `null` values so users can unset configs that are set on the default hadoop configuration for the session. But I'm not sure if setting to `null` would correctly revert it to the default value. I think they'd need to explicitly set it to whatever default value they want.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a change in pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on a change in pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792#discussion_r670850657



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java
##########
@@ -170,4 +174,34 @@ public static boolean useTimestampWithoutZoneInNewTables(RuntimeConfig sessionCo
     return false;
   }
 
+  /**
+   * Pulls any Catalog specific overrides for the Hadoop conf from the current SparkSession, which can be
+   * set via spark.sql.catalog.$catalogName.hadoop.*
+   *
+   * The SparkCatalog allows for hadoop configurations to be overridden per catalog, by setting
+   * them on the SQLConf, where the following will add the property "fs.default.name" with value
+   * "hdfs://hanksnamenode:8020" to the catalog's hadoop configuration.
+   *   SparkSession.builder()
+   *     .config(s"spark.sql.catalog.$catalogName.hadoop.fs.default.name", "hdfs://hanksnamenode:8020")
+   *     .getOrCreate()
+   * @param spark The current Spark session
+   * @param catalogName Name of the catalog to find overrides for.
+   * @return the Hadoop Configuration that should be used for this catalog, with catalog specific overrides applied.
+   */
+  public static Configuration hadoopConfCatalogOverrides(SparkSession spark, String catalogName) {
+    // Find keys for the catalog intended to be hadoop configurations
+    final String hadoopConfCatalogPrefix = String.format("%s.%s.%s", SPARK_CATALOG_CONF_PREFIX, catalogName, "hadoop.");
+    final Configuration conf = spark.sessionState().newHadoopConf();
+    // settings is a java.util.Collections.synchronizedMap and needs to be wrapped in `synchronized`.
+    synchronized (spark.sqlContext().conf().settings()) {

Review comment:
       Yea looks good to me to remove it, thanks for looking into it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792#discussion_r670843623



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java
##########
@@ -99,4 +103,32 @@ public static void validatePartitionTransforms(PartitionSpec spec) {
       }
     }
   }
+
+  /**
+   * Pulls any Catalog specific overrides for the Hadoop conf from the current SparkSession, which can be
+   * set via spark.sql.catalog.$catalogName.hadoop.*
+   *
+   * The SparkCatalog allows for hadoop configurations to be overridden per catalog, by setting
+   * them on the SQLConf, where the following will add the property "fs.default.name" with value
+   * "hdfs://hanksnamenode:8020" to the catalog's hadoop configuration.
+   *   SparkSession.builder()
+   *     .config(s"spark.sql.catalog.$catalogName.hadoop.fs.default.name", "hdfs://hanksnamenode:8020")
+   *     .getOrCreate()
+   * @param spark The current Spark session
+   * @param catalogName Name of the catalog to find overrides for.
+   * @return the Hadoop Configuration that should be used for this catalog, with catalog specific overrides applied.
+   */
+  public static Configuration hadoopConfCatalogOverrides(SparkSession spark, String catalogName) {
+    // Find keys for the catalog intended to be hadoop configurations
+    final String hadoopConfCatalogPrefix = String.format("%s.%s.%s", ICEBERG_CATALOG_PREFIX, catalogName, "hadoop.");
+    Configuration conf = spark.sessionState().newHadoopConf();
+    spark.sqlContext().conf().settings().forEach((k, v) -> {
+      // These checks are copied from `spark.sessionState().newHadoopConfWithOptions()`, which we

Review comment:
       I added the `synchronized` call as you mentioned, but I'm not sure it's necessary to be honest. If you see the note on the new code, the `forEach` method already uses `synchronize` on itself. And the docs say that it's necessary when using an iterator over it, but it seems that the `forEach` method winds up synchronized.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a change in pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on a change in pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792#discussion_r669347860



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java
##########
@@ -99,4 +103,32 @@ public static void validatePartitionTransforms(PartitionSpec spec) {
       }
     }
   }
+
+  /**
+   * Pulls any Catalog specific overrides for the Hadoop conf from the current SparkSession, which can be
+   * set via spark.sql.catalog.$catalogName.hadoop.*
+   *
+   * The SparkCatalog allows for hadoop configurations to be overridden per catalog, by setting
+   * them on the SQLConf, where the following will add the property "fs.default.name" with value
+   * "hdfs://hanksnamenode:8020" to the catalog's hadoop configuration.
+   *   SparkSession.builder()
+   *     .config(s"spark.sql.catalog.$catalogName.hadoop.fs.default.name", "hdfs://hanksnamenode:8020")
+   *     .getOrCreate()
+   * @param spark The current Spark session
+   * @param catalogName Name of the catalog to find overrides for.
+   * @return the Hadoop Configuration that should be used for this catalog, with catalog specific overrides applied.
+   */
+  public static Configuration hadoopConfCatalogOverrides(SparkSession spark, String catalogName) {
+    // Find keys for the catalog intended to be hadoop configurations
+    final String hadoopConfCatalogPrefix = String.format("%s.%s.%s", ICEBERG_CATALOG_PREFIX, catalogName, "hadoop.");
+    Configuration conf = spark.sessionState().newHadoopConf();
+    spark.sqlContext().conf().settings().forEach((k, v) -> {
+      // These checks are copied from `spark.sessionState().newHadoopConfWithOptions()`, which we

Review comment:
       I see we lost the synchronize { } call that we would have had via old spark.sessionState.newHadoopConf call (via getAllConfs method), should we use it as the backing SynchronizedMap object expects everyone to call synchronize when getting the map?  (Esp as this is a public API)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org