You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/06/14 08:01:36 UTC

[GitHub] [iceberg] rdblue commented on a change in pull request #2678: Core: add HadoopConfigurable interface to serialize custom FileIO

rdblue commented on a change in pull request #2678:
URL: https://github.com/apache/iceberg/pull/2678#discussion_r650445221



##########
File path: core/src/main/java/org/apache/iceberg/hadoop/HadoopConfigurable.java
##########
@@ -0,0 +1,49 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.hadoop;
+
+import org.apache.hadoop.conf.Configurable;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.iceberg.util.SerializableSupplier;
+
+/**
+ * An interface that extends the Hadoop {@link Configurable} interface to
+ * offer better serialization support for customizable Iceberg objects
+ * such as {@link org.apache.iceberg.io.FileIO}.
+ * <p>
+ * If an object is serialized and needs to use Hadoop configuration,
+ * it is recommend for the object to implement this interface so that
+ * a serializable supplier of configuration can be provided instead of
+ * an actual Hadoop configuration which is not serializable.

Review comment:
       Nit: why is line wrapping here so different than lines 39-42 below?

##########
File path: core/src/main/java/org/apache/iceberg/util/SerializationUtil.java
##########
@@ -43,6 +45,16 @@ private SerializationUtil() {
     }
   }
 
+  public static byte[] serializeToBytesWithHadoopConfig(Object obj) {

Review comment:
       Why introduce a separate method here rather than supporting `HadoopConfigurable` in `serializeToBytes`? It seems less useful if use of `HadoopConfigurable` isn't automatic and you need to remember to call the right method.

##########
File path: core/src/main/java/org/apache/iceberg/hadoop/HadoopConfigurable.java
##########
@@ -0,0 +1,49 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.hadoop;
+
+import org.apache.hadoop.conf.Configurable;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.iceberg.util.SerializableSupplier;
+
+/**
+ * An interface that extends the Hadoop {@link Configurable} interface to
+ * offer better serialization support for customizable Iceberg objects
+ * such as {@link org.apache.iceberg.io.FileIO}.
+ * <p>
+ * If an object is serialized and needs to use Hadoop configuration,
+ * it is recommend for the object to implement this interface so that
+ * a serializable supplier of configuration can be provided instead of
+ * an actual Hadoop configuration which is not serializable.
+ */
+public interface HadoopConfigurable extends Configurable {
+
+  /**
+   * A serializable object requiring Hadoop configuration in Iceberg must be able to
+   * accept a Hadoop configuration supplier,and use the supplier to get Hadoop configurations.
+   * This ensures that Hadoop configuration can be serialized and passed around works in a distributed environment.
+   * @param confSupplier a serializable supplier of Hadoop configuration
+   */
+  default void setConfSupplier(SerializableSupplier<Configuration> confSupplier) {

Review comment:
       Why is this defaulted? Can't we just leave it without the default to require an implementation in all of the classes this is added to?

##########
File path: core/src/main/java/org/apache/iceberg/hadoop/HadoopFileIO.java
##########
@@ -84,4 +83,9 @@ public void setConf(Configuration conf) {
   public Configuration getConf() {
     return hadoopConf.get();
   }
+
+  @Override
+  public void setConfSupplier(SerializableSupplier<Configuration> confSupplier) {

Review comment:
       What about changing this to `Function<Configuration, SerializableSupplier<Configuration>>` so that the caller doesn't need to call `getConf`? It seems brittle to me that the caller is expected to set the supplier using the same conf that was taken from the `HadoopConfigurable`.

##########
File path: spark2/src/main/java/org/apache/iceberg/spark/source/Reader.java
##########
@@ -132,7 +132,7 @@
     this.splitLookback = options.get(SparkReadOptions.LOOKBACK).map(Integer::parseInt).orElse(null);
     this.splitOpenFileCost = options.get(SparkReadOptions.FILE_OPEN_COST).map(Long::parseLong).orElse(null);
 
-    if (table.io() instanceof HadoopFileIO) {
+    if (table.io() instanceof HadoopConfigurable) {

Review comment:
       I don't think this is correct. The logic in this `if` statement assumes that it can create a `FileSystem` for the table's location if its `FileIO` is a `HadoopFileIO`. That's not necessarily the case if the io is just `HadoopConfigurable` because I might have my own implementation that for some reason uses a Hadoop conf.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org