You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/12/15 16:35:16 UTC

[GitHub] [iceberg] marton-bod opened a new pull request #3752: Core: add config for disabling FileIO hadoop configuration serialization

marton-bod opened a new pull request #3752:
URL: https://github.com/apache/iceberg/pull/3752


   Hive serializes the Iceberg table object into each individual split so that the table object can be deserialized on the executor side for efficient reads/writes (also because executors might not be authorized to communicate with the metastore to load the table). 
   
   Since the FileIO is part of the table and it has its own hadoop configuration, this configuration will be the dominant factor determining the size of the serialized split. In our tests we have found that due to this serialized config, iceberg splits are 15-20x larger than normal Hive splits (which led to OOM in some of our perf tests). 
   
   This PR proposes to introduce a config which can turn off this config serialization, and let the deserializer-side fill out the config values instead (which works for Hive executors, since they have all the config values in hand). This can reduce the split size by ~20x based on local tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] marton-bod commented on a change in pull request #3752: Core: add config for disabling FileIO hadoop configuration serialization

Posted by GitBox <gi...@apache.org>.
marton-bod commented on a change in pull request #3752:
URL: https://github.com/apache/iceberg/pull/3752#discussion_r770421748



##########
File path: mr/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java
##########
@@ -210,6 +210,7 @@ public void initialize(InputSplit split, TaskAttemptContext newContext) {
       this.context = newContext;
       Table table = ((IcebergSplit) split).table();
       this.io = table.io();
+      this.io.initialize(conf.getValByRegex(""));

Review comment:
       I'll actually refactor this to use the `setConf()` method of the `HadoopConfigurable`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #3752: Hive: add config for disabling FileIO hadoop configuration serialization

Posted by GitBox <gi...@apache.org>.
rdblue merged pull request #3752:
URL: https://github.com/apache/iceberg/pull/3752


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] marton-bod commented on a change in pull request #3752: Hive: add config for disabling FileIO hadoop configuration serialization

Posted by GitBox <gi...@apache.org>.
marton-bod commented on a change in pull request #3752:
URL: https://github.com/apache/iceberg/pull/3752#discussion_r773748848



##########
File path: mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java
##########
@@ -165,12 +167,47 @@ public DecomposedPredicate decomposePredicate(JobConf jobConf, Deserializer dese
 
   /**
    * Returns the Table serialized to the configuration based on the table name.
+   * If configuration is missing from the FileIO of the table, it will be populated with the input config.
+   *
    * @param config The configuration used to get the data from
    * @param name The name of the table we need as returned by TableDesc.getTableName()
    * @return The Table
    */
   public static Table table(Configuration config, String name) {
-    return SerializationUtil.deserializeFromBase64(config.get(InputFormatConfig.SERIALIZED_TABLE_PREFIX + name));
+    Table table = SerializationUtil.deserializeFromBase64(config.get(InputFormatConfig.SERIALIZED_TABLE_PREFIX + name));
+    checkAndSetIoConfig(config, table);
+    return table;
+  }
+
+  /**
+   * If enabled, it populates the FileIO's hadoop configuration with the input config object.
+   * This might be necessary when the table object was serialized without the FileIO config.
+   *
+   * @param config Configuration to set for FileIO, if enabled
+   * @param table The Iceberg table object
+   */
+  public static void checkAndSetIoConfig(Configuration config, Table table) {
+    if (config.getBoolean(InputFormatConfig.CONFIG_SERIALIZATION_DISABLED, false) &&

Review comment:
       Sure, makes sense. Moved it to a default property.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] marton-bod commented on a change in pull request #3752: Core: add config for disabling FileIO hadoop configuration serialization

Posted by GitBox <gi...@apache.org>.
marton-bod commented on a change in pull request #3752:
URL: https://github.com/apache/iceberg/pull/3752#discussion_r770431382



##########
File path: core/src/main/java/org/apache/iceberg/SerializableTable.java
##########
@@ -107,7 +108,13 @@ private String metadataFileLocation(Table table) {
 
   private FileIO fileIO(Table table) {
     if (table.io() instanceof HadoopConfigurable) {
-      ((HadoopConfigurable) table.io()).serializeConfWith(conf -> new SerializableConfiguration(conf)::get);
+      HadoopConfigurable configurableIo = (HadoopConfigurable) table.io();
+      if ("true".equalsIgnoreCase(configurableIo.getConf().get(ConfigProperties.CONFIG_SERIALIZATION_DISABLED))) {
+        // serialize with an empty Configuration object - configs should be populated on the deserializer-side
+        configurableIo.serializeConfWith(conf -> new SerializableConfiguration(new Configuration(false))::get);

Review comment:
       Good point. Created a new implementation for that.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] marton-bod commented on a change in pull request #3752: Core: add config for disabling FileIO hadoop configuration serialization

Posted by GitBox <gi...@apache.org>.
marton-bod commented on a change in pull request #3752:
URL: https://github.com/apache/iceberg/pull/3752#discussion_r772517085



##########
File path: core/src/main/java/org/apache/iceberg/SerializableTable.java
##########
@@ -107,7 +108,13 @@ private String metadataFileLocation(Table table) {
 
   private FileIO fileIO(Table table) {
     if (table.io() instanceof HadoopConfigurable) {
-      ((HadoopConfigurable) table.io()).serializeConfWith(conf -> new SerializableConfiguration(conf)::get);
+      HadoopConfigurable configurableIo = (HadoopConfigurable) table.io();
+      if (configurableIo.getConf().getBoolean(ConfigProperties.CONFIG_SERIALIZATION_DISABLED, false)) {
+        // do not serialize the Configuration object - configs should be set on the deserializer-side
+        configurableIo.serializeConfWith(conf -> new SkipSerializableConfiguration(conf)::get);
+      } else {
+        configurableIo.serializeConfWith(conf -> new SerializableConfiguration(conf)::get);
+      }

Review comment:
       That's a good idea, thanks @rdblue! I have refactored the code to use a Hive-side config property and call `serializeConfWith` from the Hive storage handler instead. 
   
   The one thing I've noticed and don't really like is that upon using `SerializationUtil.serializeToBase64(table)`, we enter `SerializableTable.copyOf()/SerializableTable.fileIO()`, which would overwrite any potential custom `SerializableSupplier` we might have set earlier for the table via `serializeConfWith`. I wonder if we might want to provide some customizability for the `SerializableTable.copyOf()` in the future. But that could be addressed in a separate PR, if desired.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on pull request #3752: Hive: add config for disabling FileIO hadoop configuration serialization

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on pull request #3752:
URL: https://github.com/apache/iceberg/pull/3752#issuecomment-1001741535


   We have 2 approvals. I will wait for one more day in case Ryan is back online and can take another look. Otherwise I think it's good to be merged, thanks for the work!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a change in pull request #3752: Core: add config for disabling FileIO hadoop configuration serialization

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #3752:
URL: https://github.com/apache/iceberg/pull/3752#discussion_r769807450



##########
File path: mr/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java
##########
@@ -210,6 +210,7 @@ public void initialize(InputSplit split, TaskAttemptContext newContext) {
       this.context = newContext;
       Table table = ((IcebergSplit) split).table();
       this.io = table.io();
+      this.io.initialize(conf.getValByRegex(""));

Review comment:
       Might worth to check if this regex is faster, or using the iterator interface for Config to create a Map, and adding that to `initialize` is faster




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a change in pull request #3752: Core: add config for disabling FileIO hadoop configuration serialization

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #3752:
URL: https://github.com/apache/iceberg/pull/3752#discussion_r770470416



##########
File path: core/src/main/java/org/apache/iceberg/SerializableTable.java
##########
@@ -386,4 +393,20 @@ public Configuration get() {
       return conf;
     }
   }
+
+  private static class EmptySerializedConfiguration implements Serializable {
+
+    private transient volatile Configuration conf = null;
+
+    public Configuration get() {
+      if (conf == null) {
+        synchronized (this) {

Review comment:
       Is this a valid situation? We might want to throw an exception here, since this will be an invalid configuration object in this case.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3752: Core: add config for disabling FileIO hadoop configuration serialization

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3752:
URL: https://github.com/apache/iceberg/pull/3752#discussion_r771543771



##########
File path: core/src/main/java/org/apache/iceberg/SerializableTable.java
##########
@@ -386,4 +393,21 @@ public Configuration get() {
       return conf;
     }
   }
+
+  private static class SkipSerializableConfiguration implements Serializable {
+
+    private transient volatile Configuration conf;
+
+    SkipSerializableConfiguration(Configuration conf) {
+      this.conf = conf;
+    }
+
+    public Configuration get() {
+      if (conf == null) {
+        throw new IllegalStateException("Configuration was not serialized on purpose but was not set manually either");

Review comment:
       What about logging a warning and then calling `new Configuration()`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a change in pull request #3752: Hive: add config for disabling FileIO hadoop configuration serialization

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #3752:
URL: https://github.com/apache/iceberg/pull/3752#discussion_r772887138



##########
File path: mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java
##########
@@ -244,4 +283,21 @@ static void overlayTableProperties(Configuration configuration, TableDesc tableD
     // this is an exception to the interface documentation, but it's a safe operation to add this property
     props.put(InputFormatConfig.TABLE_SCHEMA, schemaJson);
   }
+
+  private static class NonSerializingConfig implements Serializable {
+
+    private transient volatile Configuration conf;

Review comment:
       Do we set the config from multiple threads?
   Do we set the config at all?
   
   I think this could be `final` instead of `volatile`, as we so not set the conf, we create a new `NonSerializingConfig`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] marton-bod commented on a change in pull request #3752: Core: add config for disabling FileIO hadoop configuration serialization

Posted by GitBox <gi...@apache.org>.
marton-bod commented on a change in pull request #3752:
URL: https://github.com/apache/iceberg/pull/3752#discussion_r770696281



##########
File path: core/src/main/java/org/apache/iceberg/SerializableTable.java
##########
@@ -386,4 +393,20 @@ public Configuration get() {
       return conf;
     }
   }
+
+  private static class EmptySerializedConfiguration implements Serializable {
+
+    private transient volatile Configuration conf = null;
+
+    public Configuration get() {
+      if (conf == null) {
+        synchronized (this) {

Review comment:
       No, that's a good point, it should throw an exception. This `get()` method was mistakenly accessed during split generation part, where the manifest files are being read even though the config has already been removed earlier by calling the `SerializableTable.copyOf(table)` method.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a change in pull request #3752: Hive: add config for disabling FileIO hadoop configuration serialization

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #3752:
URL: https://github.com/apache/iceberg/pull/3752#discussion_r772888871



##########
File path: core/src/main/java/org/apache/iceberg/SerializableTable.java
##########
@@ -107,7 +108,13 @@ private String metadataFileLocation(Table table) {
 
   private FileIO fileIO(Table table) {
     if (table.io() instanceof HadoopConfigurable) {
-      ((HadoopConfigurable) table.io()).serializeConfWith(conf -> new SerializableConfiguration(conf)::get);
+      HadoopConfigurable configurableIo = (HadoopConfigurable) table.io();
+      if (configurableIo.getConf().getBoolean(ConfigProperties.CONFIG_SERIALIZATION_DISABLED, false)) {
+        // do not serialize the Configuration object - configs should be set on the deserializer-side
+        configurableIo.serializeConfWith(conf -> new SkipSerializableConfiguration(conf)::get);
+      } else {
+        configurableIo.serializeConfWith(conf -> new SerializableConfiguration(conf)::get);
+      }

Review comment:
       Also setting the configuration with `HadoopFileIO.setConf(Configuration conf)` might lead to a few surprises, as it changes the `SerializableSupplier` implementation behind the scenes.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #3752: Hive: add config for disabling FileIO hadoop configuration serialization

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #3752:
URL: https://github.com/apache/iceberg/pull/3752#issuecomment-1002787380


   Looks good now that this is Hive-only. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a change in pull request #3752: Core: add config for disabling FileIO hadoop configuration serialization

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #3752:
URL: https://github.com/apache/iceberg/pull/3752#discussion_r769797824



##########
File path: core/src/main/java/org/apache/iceberg/SerializableTable.java
##########
@@ -107,7 +108,13 @@ private String metadataFileLocation(Table table) {
 
   private FileIO fileIO(Table table) {
     if (table.io() instanceof HadoopConfigurable) {
-      ((HadoopConfigurable) table.io()).serializeConfWith(conf -> new SerializableConfiguration(conf)::get);
+      HadoopConfigurable configurableIo = (HadoopConfigurable) table.io();
+      if ("true".equalsIgnoreCase(configurableIo.getConf().get(ConfigProperties.CONFIG_SERIALIZATION_DISABLED))) {

Review comment:
       I think we should use: `((HadoopConfigurable) table.io()).getConf().getBoolean()`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] marton-bod commented on a change in pull request #3752: Hive: add config for disabling FileIO hadoop configuration serialization

Posted by GitBox <gi...@apache.org>.
marton-bod commented on a change in pull request #3752:
URL: https://github.com/apache/iceberg/pull/3752#discussion_r773748459



##########
File path: mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java
##########
@@ -244,4 +283,21 @@ static void overlayTableProperties(Configuration configuration, TableDesc tableD
     // this is an exception to the interface documentation, but it's a safe operation to add this property
     props.put(InputFormatConfig.TABLE_SCHEMA, schemaJson);
   }
+
+  private static class NonSerializingConfig implements Serializable {
+
+    private transient volatile Configuration conf;

Review comment:
       Yep, good point




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on pull request #3752: Core: add config for disabling FileIO hadoop configuration serialization

Posted by GitBox <gi...@apache.org>.
pvary commented on pull request #3752:
URL: https://github.com/apache/iceberg/pull/3752#issuecomment-996692420


   Looks good to me now. @rdblue, @jackye1995, I would like to ask your opinion around this before committing.
   Thanks,
   Peter
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3752: Core: add config for disabling FileIO hadoop configuration serialization

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3752:
URL: https://github.com/apache/iceberg/pull/3752#discussion_r771545816



##########
File path: core/src/main/java/org/apache/iceberg/SerializableTable.java
##########
@@ -107,7 +108,13 @@ private String metadataFileLocation(Table table) {
 
   private FileIO fileIO(Table table) {
     if (table.io() instanceof HadoopConfigurable) {
-      ((HadoopConfigurable) table.io()).serializeConfWith(conf -> new SerializableConfiguration(conf)::get);
+      HadoopConfigurable configurableIo = (HadoopConfigurable) table.io();
+      if (configurableIo.getConf().getBoolean(ConfigProperties.CONFIG_SERIALIZATION_DISABLED, false)) {
+        // do not serialize the Configuration object - configs should be set on the deserializer-side
+        configurableIo.serializeConfWith(conf -> new SkipSerializableConfiguration(conf)::get);
+      } else {
+        configurableIo.serializeConfWith(conf -> new SerializableConfiguration(conf)::get);
+      }

Review comment:
       Is it possible to do this in Hive rather than here? Spark uses its own `SerializableConfiguration` that is compatible with Kryo and calls `serializeConfWith` to customize this. I think ideally that is how Hive would work as well. That avoids needing to have a general Hadoop config property. Instead, we can have one just for Hive where it's okay to set things in the Hadoop conf.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] marton-bod commented on a change in pull request #3752: Core: add config for disabling FileIO hadoop configuration serialization

Posted by GitBox <gi...@apache.org>.
marton-bod commented on a change in pull request #3752:
URL: https://github.com/apache/iceberg/pull/3752#discussion_r772521518



##########
File path: core/src/main/java/org/apache/iceberg/SerializableTable.java
##########
@@ -386,4 +393,21 @@ public Configuration get() {
       return conf;
     }
   }
+
+  private static class SkipSerializableConfiguration implements Serializable {
+
+    private transient volatile Configuration conf;
+
+    SkipSerializableConfiguration(Configuration conf) {
+      this.conf = conf;
+    }
+
+    public Configuration get() {
+      if (conf == null) {
+        throw new IllegalStateException("Configuration was not serialized on purpose but was not set manually either");

Review comment:
       `new Configuration()` would create an empty config object in Hive, without reading all the hive-site.xml values (unlike `new HiveConf()`). Using this empty configuration could then lead to file IO failures or unexpected behaviour. Deciding not to serialize the FileIO config must be followed up by setting the missing config explicitly on the deserializer-side. I would rather throw an exception and catch clients who have failed to do so (and fix those instances), than proceed with creating an empty conf which can then lead to issues.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary edited a comment on pull request #3752: Hive: add config for disabling FileIO hadoop configuration serialization

Posted by GitBox <gi...@apache.org>.
pvary edited a comment on pull request #3752:
URL: https://github.com/apache/iceberg/pull/3752#issuecomment-997361482


   @jackye1995: By my understanding the FileIO could be defined on `Catalog` level:
   https://github.com/apache/iceberg/blob/f68d8d426661efc0d7e5686fe833b573b74eadab/hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java#L87-L88
   
   So it could be perfectly valid to have different `FileIo` for every `Catalog`, so I think we should not override this on the engine side.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #3752: Hive: add config for disabling FileIO hadoop configuration serialization

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #3752:
URL: https://github.com/apache/iceberg/pull/3752#discussion_r773355451



##########
File path: mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java
##########
@@ -165,12 +167,47 @@ public DecomposedPredicate decomposePredicate(JobConf jobConf, Deserializer dese
 
   /**
    * Returns the Table serialized to the configuration based on the table name.
+   * If configuration is missing from the FileIO of the table, it will be populated with the input config.
+   *
    * @param config The configuration used to get the data from
    * @param name The name of the table we need as returned by TableDesc.getTableName()
    * @return The Table
    */
   public static Table table(Configuration config, String name) {
-    return SerializationUtil.deserializeFromBase64(config.get(InputFormatConfig.SERIALIZED_TABLE_PREFIX + name));
+    Table table = SerializationUtil.deserializeFromBase64(config.get(InputFormatConfig.SERIALIZED_TABLE_PREFIX + name));
+    checkAndSetIoConfig(config, table);
+    return table;
+  }
+
+  /**
+   * If enabled, it populates the FileIO's hadoop configuration with the input config object.
+   * This might be necessary when the table object was serialized without the FileIO config.
+   *
+   * @param config Configuration to set for FileIO, if enabled
+   * @param table The Iceberg table object
+   */
+  public static void checkAndSetIoConfig(Configuration config, Table table) {
+    if (config.getBoolean(InputFormatConfig.CONFIG_SERIALIZATION_DISABLED, false) &&

Review comment:
       nit: false should be moved to be `InputFormatConfig.CONFIG_SERIALIZATION_DISABLED_DEFAULT`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on pull request #3752: Core: add config for disabling FileIO hadoop configuration serialization

Posted by GitBox <gi...@apache.org>.
pvary commented on pull request #3752:
URL: https://github.com/apache/iceberg/pull/3752#issuecomment-997361482


   @jackye1995: By my understanding the FileIo could be defined on `Catalog` level:
   https://github.com/apache/iceberg/blob/f68d8d426661efc0d7e5686fe833b573b74eadab/hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java#L87-L88
   
   So it could be perfectly valid to have different `FileIo` for every `Catalog`, so I think we should not override this on the engine side.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] marton-bod commented on a change in pull request #3752: Core: add config for disabling FileIO hadoop configuration serialization

Posted by GitBox <gi...@apache.org>.
marton-bod commented on a change in pull request #3752:
URL: https://github.com/apache/iceberg/pull/3752#discussion_r770010919



##########
File path: core/src/main/java/org/apache/iceberg/SerializableTable.java
##########
@@ -107,7 +108,13 @@ private String metadataFileLocation(Table table) {
 
   private FileIO fileIO(Table table) {
     if (table.io() instanceof HadoopConfigurable) {
-      ((HadoopConfigurable) table.io()).serializeConfWith(conf -> new SerializableConfiguration(conf)::get);
+      HadoopConfigurable configurableIo = (HadoopConfigurable) table.io();
+      if ("true".equalsIgnoreCase(configurableIo.getConf().get(ConfigProperties.CONFIG_SERIALIZATION_DISABLED))) {

Review comment:
       Yep, good idea




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on pull request #3752: Core: add config for disabling FileIO hadoop configuration serialization

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on pull request #3752:
URL: https://github.com/apache/iceberg/pull/3752#issuecomment-997030992


   Sounds like what is needed here is not the ability to skip serialization of the config, but to add the ability to set the entire `FileIO` for a `SerializableTable` at worker node, because Hadoop configuration might not be the only thing that could take a lot of serialization space, and it seems like we are adding a special handling just for that.
   
   There is also some security concerns in this approach when you say
   > also because executors might not be authorized to communicate with the metastore to load the table
   
   In Trino we have a singleton `FileIOProvider`, which is responsible for creating `FileIO`s used for worker nodes to avoid these issues. I don't know if similar concern is valid for Hive.
   
   I think it is better if we can allow skipping serialization of the entire FileIO object, add some hooks like `SerializableTable.initializeFileIO(fileIO)`, and let the engine create `FileIO` based on their specific requirements outside the serialization logic.
   
   Any thoughts?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a change in pull request #3752: Core: add config for disabling FileIO hadoop configuration serialization

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #3752:
URL: https://github.com/apache/iceberg/pull/3752#discussion_r769800134



##########
File path: core/src/main/java/org/apache/iceberg/SerializableTable.java
##########
@@ -107,7 +108,13 @@ private String metadataFileLocation(Table table) {
 
   private FileIO fileIO(Table table) {
     if (table.io() instanceof HadoopConfigurable) {
-      ((HadoopConfigurable) table.io()).serializeConfWith(conf -> new SerializableConfiguration(conf)::get);
+      HadoopConfigurable configurableIo = (HadoopConfigurable) table.io();
+      if ("true".equalsIgnoreCase(configurableIo.getConf().get(ConfigProperties.CONFIG_SERIALIZATION_DISABLED))) {
+        // serialize with an empty Configuration object - configs should be populated on the deserializer-side
+        configurableIo.serializeConfWith(conf -> new SerializableConfiguration(new Configuration(false))::get);

Review comment:
       Shouldn't we create a new implementation without the extra magic / code / method calls in `SerializableConfiguration`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] marton-bod commented on a change in pull request #3752: Core: add config for disabling FileIO hadoop configuration serialization

Posted by GitBox <gi...@apache.org>.
marton-bod commented on a change in pull request #3752:
URL: https://github.com/apache/iceberg/pull/3752#discussion_r770431772



##########
File path: mr/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java
##########
@@ -210,6 +210,7 @@ public void initialize(InputSplit split, TaskAttemptContext newContext) {
       this.context = newContext;
       Table table = ((IcebergSplit) split).table();
       this.io = table.io();
+      this.io.initialize(conf.getValByRegex(""));

Review comment:
       Also moved the setConf() logic into the table deserialization part in the StorageHandler




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary edited a comment on pull request #3752: Hive: add config for disabling FileIO hadoop configuration serialization

Posted by GitBox <gi...@apache.org>.
pvary edited a comment on pull request #3752:
URL: https://github.com/apache/iceberg/pull/3752#issuecomment-997361482


   @jackye1995: By my understanding the `FileIO` could be defined on `Catalog` level:
   https://github.com/apache/iceberg/blob/f68d8d426661efc0d7e5686fe833b573b74eadab/hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java#L87-L88
   
   So it could be perfectly valid to have different `FileIO` for every `Catalog`, so I think we should not override this on the engine side.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a change in pull request #3752: Hive: add config for disabling FileIO hadoop configuration serialization

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #3752:
URL: https://github.com/apache/iceberg/pull/3752#discussion_r772881447



##########
File path: mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java
##########
@@ -232,8 +269,10 @@ static void overlayTableProperties(Configuration configuration, TableDesc tableD
     map.put(InputFormatConfig.TABLE_SCHEMA, schemaJson);
 
     if (table instanceof Serializable) {

Review comment:
       Do we need this check?
   I think it was introduced when MetaData tables were not `Serializable`. Now, all of the tables should work with `SerializableTable.copyOf(table)`, so I think we can remove this check




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] marton-bod commented on a change in pull request #3752: Hive: add config for disabling FileIO hadoop configuration serialization

Posted by GitBox <gi...@apache.org>.
marton-bod commented on a change in pull request #3752:
URL: https://github.com/apache/iceberg/pull/3752#discussion_r773748237



##########
File path: mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java
##########
@@ -232,8 +269,10 @@ static void overlayTableProperties(Configuration configuration, TableDesc tableD
     map.put(InputFormatConfig.TABLE_SCHEMA, schemaJson);
 
     if (table instanceof Serializable) {

Review comment:
       No, we don't need this anymore. I've removed it, thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] marton-bod commented on pull request #3752: Hive: add config for disabling FileIO hadoop configuration serialization

Posted by GitBox <gi...@apache.org>.
marton-bod commented on pull request #3752:
URL: https://github.com/apache/iceberg/pull/3752#issuecomment-999473988


   Thanks for the reviews @jackye1995 , @rdblue and @pvary! 
   Let me know if you have additional thoughts or now this looks good for merging.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org