You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/04/01 01:49:17 UTC

[GitHub] [iceberg] aokolnychyi opened a new pull request #2403: Core: Add SerializedTable and SerializedMetadataTable

aokolnychyi opened a new pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403


   This PR adds reconsiders our table serialization to avoid sending extra requests to access frequently needed metadata.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] stevenzwu commented on a change in pull request #2403: Core: Add SerializableTable

Posted by GitBox <gi...@apache.org>.
stevenzwu commented on a change in pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#discussion_r607336166



##########
File path: spark/src/test/java/org/apache/iceberg/KryoHelpers.java
##########
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import com.esotericsoftware.kryo.Kryo;
+import com.esotericsoftware.kryo.io.Input;
+import com.esotericsoftware.kryo.io.Output;
+import java.io.ByteArrayInputStream;
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+import java.io.ObjectInputStream;
+import java.io.ObjectOutputStream;
+import org.apache.spark.SparkConf;
+import org.apache.spark.serializer.KryoSerializer;
+
+public class KryoHelpers {

Review comment:
       sounds good. thx a lot for the context




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2403: Core: Add SerializableTableFactory

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#discussion_r606970032



##########
File path: core/src/main/java/org/apache/iceberg/SerializableTableFactory.java
##########
@@ -0,0 +1,383 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.List;
+import java.util.Map;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.iceberg.encryption.EncryptionManager;
+import org.apache.iceberg.hadoop.HadoopFileIO;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.LocationProvider;
+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+import org.apache.iceberg.util.SerializableMap;
+
+/**
+ * A factory to create serializable tables.
+ */
+public class SerializableTableFactory {

Review comment:
       We have two table implementations: one for base and one for metadata tables. I don't think making those two classes public is a good idea. We need to do a switch somewhere and expose a method for building serializable tables.
   
   While we could name the class like `SerializableTable` and do a switch there, it may be a bit weird that it does not implement `Table`. Let me think more about this. Maybe, that's fine.
   
   ```
   SerializableTable.copyOf(table)
   SerializableTables.copyOf(table)
   SerializableTableFactory.copyOf(table)
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi edited a comment on pull request #2403: Core: Add SerializedTable and SerializedMetadataTable

Posted by GitBox <gi...@apache.org>.
aokolnychyi edited a comment on pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#issuecomment-812807604


   Since I redesigned the original approach, we may consider making serialized table classes non-public and offering a factory or a util class to build them instead. We will need to construct serialized tables manually for Kryo.
   
   ```
   SerializedTables.newSerializedTable(table)
   SerializedTables.createSerializedTable(table)
   SerializedTables.forTable(table)
   ```
   
   I don't want to call it a util as it has to be in `org.apache.iceberg` package.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2403: Core: Add SerializedTable and SerializedMetadataTable

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#discussion_r605787928



##########
File path: core/src/main/java/org/apache/iceberg/SortOrderParser.java
##########
@@ -50,6 +52,26 @@ public static void toJson(SortOrder sortOrder, JsonGenerator generator) throws I
     generator.writeEndObject();
   }
 
+  public static String toJson(SortOrder sortOrder) {
+    return toJson(sortOrder, false);
+  }
+
+  public static String toJson(SortOrder sortOrder, boolean pretty) {
+    try {
+      StringWriter writer = new StringWriter();
+      JsonGenerator generator = JsonUtil.factory().createGenerator(writer);
+      if (pretty) {
+        generator.useDefaultPrettyPrinter();
+      }
+      toJson(sortOrder, generator);
+      generator.flush();
+      return writer.toString();
+
+    } catch (IOException e) {
+      throw new RuntimeIOException(e);

Review comment:
       nit: is there any reason for us to still use this deprecated exception instead of just `UncheckedIOException`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #2403: Core: Add SerializedTable and SerializedMetadataTable

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#issuecomment-812807604


   Since I redesigned the original approach, we may consider making serialized table classes non-public and offering a factory or a util class to build them instead. We will need to construct serialized tables manually for Kryo.
   
   ```
   SerializedTableFactory.newSerializedTable()
   ```
   
   I don't want to call it a util as it has to be in `org.apache.iceberg` package.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2403: Core: Add SerializableTableFactory

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#discussion_r606970032



##########
File path: core/src/main/java/org/apache/iceberg/SerializableTableFactory.java
##########
@@ -0,0 +1,383 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.List;
+import java.util.Map;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.iceberg.encryption.EncryptionManager;
+import org.apache.iceberg.hadoop.HadoopFileIO;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.LocationProvider;
+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+import org.apache.iceberg.util.SerializableMap;
+
+/**
+ * A factory to create serializable tables.
+ */
+public class SerializableTableFactory {

Review comment:
       We have two table implementations: one for base and one for metadata tables. I don't think making those two classes public is a good idea. We need to do a switch somewhere and expose a method for building serializable tables.
   
   While we could name the class like `SerializableTable` and do a switch there, it may be a bit weird that it would not implement `Table`. Let me think more about this. Maybe, that's fine.
   
   ```
   SerializableTable.copyOf(table)
   SerializableTables.copyOf(table)
   SerializableTableFactory.copyOf(table)
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2403: Core: Add SerializedTable and SerializedMetadataTable

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#discussion_r606462269



##########
File path: core/src/main/java/org/apache/iceberg/SerializedTable.java
##########
@@ -0,0 +1,312 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.encryption.EncryptionManager;
+import org.apache.iceberg.hadoop.HadoopFileIO;
+import org.apache.iceberg.hadoop.SerializedConfiguration;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.LocationProvider;
+import org.apache.iceberg.util.SerializableMap;
+
+/**
+ * A read-only serializable table implementation that can be sent to other nodes.
+ * <p>
+ * While this class captures the metadata file location that can be used to load the full table
+ * metadata, it also serializes directly things like properties, schema, spec, sort order
+ * to avoid reading the metadata file from other nodes to access frequently needed metadata.

Review comment:
       Added the old comment back.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2403: Core: Add SerializableTableFactory

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#discussion_r606970032



##########
File path: core/src/main/java/org/apache/iceberg/SerializableTableFactory.java
##########
@@ -0,0 +1,383 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.List;
+import java.util.Map;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.iceberg.encryption.EncryptionManager;
+import org.apache.iceberg.hadoop.HadoopFileIO;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.LocationProvider;
+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+import org.apache.iceberg.util.SerializableMap;
+
+/**
+ * A factory to create serializable tables.
+ */
+public class SerializableTableFactory {

Review comment:
       We have two table implementations: one for base and one for metadata tables. I don't think making those two classes public is a good idea. We need to do a switch somewhere and expose a method for building serializable tables.
   
   While we could name the class like `SerializableTable` and do a switch there, it may be a bit weird that it would not implement `Table`. Let me think more about this. Maybe, we can make the metadata serializable table as a nested class.
   
   ```
   SerializableTable.copyOf(table)
   SerializableTables.copyOf(table)
   SerializableTableFactory.copyOf(table)
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #2403: Core: Add SerializedTable and SerializedMetadataTable

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#issuecomment-812970910


   I was not entirely happy with the implementation. I've decided to introduce a factory that should be used by query engines and hide the table implementations. Will need another review round. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2403: Core: Add SerializedTable and SerializedMetadataTable

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#discussion_r606418816



##########
File path: core/src/main/java/org/apache/iceberg/hadoop/SerializedConfiguration.java
##########
@@ -0,0 +1,57 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.hadoop;
+
+import java.io.Serializable;
+import java.util.Map;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+
+/**
+ * A class that captures the current state of a Hadoop configuration in a serializable manner.
+ * <p>
+ * <em>Note:</em> Unlike {@link SerializableConfiguration}, no subsequent changes to the underlying
+ * Hadoop configuration will be propagated to the captured state once an instance
+ * of {@link SerializedConfiguration} is constructed.
+ */
+public class SerializedConfiguration implements Serializable {

Review comment:
       Also, this class is Kryo compatible. So we don't have to wrap it into special classes like we do today in Spark.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2403: Core: Add SerializedTable and SerializedMetadataTable

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#discussion_r606444830



##########
File path: core/src/main/java/org/apache/iceberg/SerializedTable.java
##########
@@ -0,0 +1,312 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.encryption.EncryptionManager;
+import org.apache.iceberg.hadoop.HadoopFileIO;
+import org.apache.iceberg.hadoop.SerializedConfiguration;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.LocationProvider;
+import org.apache.iceberg.util.SerializableMap;
+
+/**
+ * A read-only serializable table implementation that can be sent to other nodes.
+ * <p>
+ * While this class captures the metadata file location that can be used to load the full table
+ * metadata, it also serializes directly things like properties, schema, spec, sort order
+ * to avoid reading the metadata file from other nodes to access frequently needed metadata.
+ * <p>
+ * <em>Note:</em> loading the full metadata from a large number of nodes can overwhelm the storage.
+ */
+public class SerializedTable implements Table, Serializable {
+  private final String name;
+  private final String location;
+  private final String metadataFileLocation;
+  private final Map<String, String> properties;
+  private final String schemaAsJson;
+  private final String specAsJson;
+  private final String sortOrderAsJson;
+  private final FileIO io;
+  private final EncryptionManager encryption;
+  private final LocationProvider locationProvider;
+
+  private transient volatile Table lazyTable = null;
+  private transient volatile Schema lazySchema = null;
+  private transient volatile PartitionSpec lazySpec = null;
+  private transient volatile SortOrder lazySortOrder = null;
+
+  public SerializedTable(Table table) {
+    this.name = table.name();
+    this.location = table.location();
+    this.metadataFileLocation = metadataFileLocation(table);
+    this.properties = SerializableMap.copyOf(table.properties());
+    this.schemaAsJson = SchemaParser.toJson(table.schema());
+    this.specAsJson = PartitionSpecParser.toJson(table.spec());
+    this.sortOrderAsJson = SortOrderParser.toJson(table.sortOrder());
+    this.io = fileIO(table);
+    this.encryption = table.encryption();
+    this.locationProvider = table.locationProvider();
+  }
+
+  private String metadataFileLocation(Table table) {
+    if (table instanceof HasTableOperations) {
+      TableOperations ops = ((HasTableOperations) table).operations();
+      return ops.current().metadataFileLocation();
+    } else {
+      return null;
+    }
+  }
+
+  private FileIO fileIO(Table table) {

Review comment:
       Ah I see, the FIleIO has to directly accept the serialized hadoop config supplier. I thought it's enough after #2333, but the `setConf()` there still needs to accept a Hadoop config and would not work with the supplier.
   
   In that case, I think we can potentially have something like a `SerializedConfigurable` interface that has method `setSerializedConfSupplier(supplier)`, and check instance of that class. `HadoopFileIO` can then implement that class instead.
   
   The main point here is that if we do it for HadoopFileIO only, I think we should at least make sure other FileIOs that leverage hadoop configuration can work with this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2403: Core: Add SerializedTable and SerializedMetadataTable

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#discussion_r606418039



##########
File path: core/src/main/java/org/apache/iceberg/SerializedTable.java
##########
@@ -0,0 +1,312 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.encryption.EncryptionManager;
+import org.apache.iceberg.hadoop.HadoopFileIO;
+import org.apache.iceberg.hadoop.SerializedConfiguration;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.LocationProvider;
+import org.apache.iceberg.util.SerializableMap;
+
+/**
+ * A read-only serializable table implementation that can be sent to other nodes.
+ * <p>
+ * While this class captures the metadata file location that can be used to load the full table
+ * metadata, it also serializes directly things like properties, schema, spec, sort order
+ * to avoid reading the metadata file from other nodes to access frequently needed metadata.
+ * <p>
+ * <em>Note:</em> loading the full metadata from a large number of nodes can overwhelm the storage.
+ */
+public class SerializedTable implements Table, Serializable {
+  private final String name;
+  private final String location;
+  private final String metadataFileLocation;
+  private final Map<String, String> properties;
+  private final String schemaAsJson;
+  private final String specAsJson;
+  private final String sortOrderAsJson;
+  private final FileIO io;
+  private final EncryptionManager encryption;
+  private final LocationProvider locationProvider;
+
+  private transient volatile Table lazyTable = null;
+  private transient volatile Schema lazySchema = null;
+  private transient volatile PartitionSpec lazySpec = null;
+  private transient volatile SortOrder lazySortOrder = null;
+
+  public SerializedTable(Table table) {
+    this.name = table.name();
+    this.location = table.location();
+    this.metadataFileLocation = metadataFileLocation(table);
+    this.properties = SerializableMap.copyOf(table.properties());
+    this.schemaAsJson = SchemaParser.toJson(table.schema());
+    this.specAsJson = PartitionSpecParser.toJson(table.spec());
+    this.sortOrderAsJson = SortOrderParser.toJson(table.sortOrder());
+    this.io = fileIO(table);
+    this.encryption = table.encryption();
+    this.locationProvider = table.locationProvider();
+  }
+
+  private String metadataFileLocation(Table table) {
+    if (table instanceof HasTableOperations) {
+      TableOperations ops = ((HasTableOperations) table).operations();
+      return ops.current().metadataFileLocation();
+    } else {
+      return null;
+    }
+  }
+
+  private FileIO fileIO(Table table) {

Review comment:
       We can do something similar to dynamic loading, use `if (table.io() instanceof Configurable)`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2403: Core: Add SerializedTable and SerializedMetadataTable

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#discussion_r606417163



##########
File path: core/src/main/java/org/apache/iceberg/hadoop/SerializedConfiguration.java
##########
@@ -0,0 +1,57 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.hadoop;
+
+import java.io.Serializable;
+import java.util.Map;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+
+/**
+ * A class that captures the current state of a Hadoop configuration in a serializable manner.
+ * <p>
+ * <em>Note:</em> Unlike {@link SerializableConfiguration}, no subsequent changes to the underlying
+ * Hadoop configuration will be propagated to the captured state once an instance
+ * of {@link SerializedConfiguration} is constructed.
+ */
+public class SerializedConfiguration implements Serializable {

Review comment:
       We cannot just replace the internal implementation of the existing `SerializableConfiguration` class as Hadoop conf is not immutable and can dynamically change. For example, getting `FileSystem` would trigger a full reload of configs. That's why I added `SerializedConfiguration` that is used only in `SerializedTable` for now. In the future, we can use `SerializedConfiguration` in `writeReplace` of `SerializableConfiguration`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2403: Core: Add SerializableTableFactory

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#discussion_r607206904



##########
File path: core/src/main/java/org/apache/iceberg/SerializableTableFactory.java
##########
@@ -0,0 +1,383 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.List;
+import java.util.Map;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.iceberg.encryption.EncryptionManager;
+import org.apache.iceberg.hadoop.HadoopFileIO;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.LocationProvider;
+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+import org.apache.iceberg.util.SerializableMap;
+
+/**
+ * A factory to create serializable tables.
+ */
+public class SerializableTableFactory {

Review comment:
       I've updated to `SerializableTable.copyOf(table)` and made the metadata table a private nested class.
   Could you take one more look, @stevenzwu?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi merged pull request #2403: Core: Add SerializableTable

Posted by GitBox <gi...@apache.org>.
aokolnychyi merged pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2403: Core: Add SerializedTable and SerializedMetadataTable

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#discussion_r605810146



##########
File path: core/src/main/java/org/apache/iceberg/BaseTransaction.java
##########
@@ -679,6 +680,10 @@ public LocationProvider locationProvider() {
     public String toString() {
       return name();
     }
+
+    Object writeReplace() {

Review comment:
       https://docs.oracle.com/javase/8/docs/api/java/io/Serializable.html
   I am learning things.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] stevenzwu commented on a change in pull request #2403: Core: Add SerializedTable and SerializedMetadataTable

Posted by GitBox <gi...@apache.org>.
stevenzwu commented on a change in pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#discussion_r606625833



##########
File path: core/src/main/java/org/apache/iceberg/SerializedTable.java
##########
@@ -0,0 +1,344 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.List;
+import java.util.Map;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.iceberg.encryption.EncryptionManager;
+import org.apache.iceberg.hadoop.HadoopFileIO;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.LocationProvider;
+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+import org.apache.iceberg.util.SerializableMap;
+
+/**
+ * A read-only serializable table implementation that can be sent to other nodes.
+ * <p>
+ * While this class captures the metadata file location that can be used to load the full table
+ * metadata, it also serializes directly things like properties, schema, spec, sort order
+ * to avoid reading the metadata file from other nodes to access frequently needed metadata.
+ * <p>
+ * This implementation assumes the passed instances of {@link FileIO}, {@link EncryptionManager},
+ * {@link LocationProvider} are serializable. If you are serializing the table using a custom
+ * serialization framework like Kryo, those instances of {@link FileIO}, {@link EncryptionManager},
+ * {@link LocationProvider} must be supported by that particular serialization framework.
+ * <p>
+ * <em>Note:</em> loading the full metadata from a large number of nodes can overwhelm the storage.
+ */
+public class SerializedTable implements Table, Serializable {

Review comment:
       nit: is `SerializableTable` a more accurate name?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #2403: Core: Add SerializedTable and SerializedMetadataTable

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#issuecomment-811579419


   @openinx @pvary @yyanyy @rdblue @RussellSpitzer @jackye1995, could you take a look, please?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2403: Core: Add SerializedTable and SerializedMetadataTable

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#discussion_r605809313



##########
File path: core/src/main/java/org/apache/iceberg/BaseTransaction.java
##########
@@ -679,6 +680,10 @@ public LocationProvider locationProvider() {
     public String toString() {
       return name();
     }
+
+    Object writeReplace() {

Review comment:
       ```Serializable classes that need to designate an alternative object to be used when writing an object to the stream should implement this special method with the exact signature:
   
    ANY-ACCESS-MODIFIER Object writeReplace() throws ObjectStreamException;
    
   This writeReplace method is invoked by serialization if the method exists and it would be accessible from a method defined within the class of the object being serialized. Thus, the method can have private, protected and package-private access. Subclass access to this method follows java accessibility rules.
   
   Classes that need to designate a replacement when an instance of it is read from the stream should implement this special method with the exact signature.
   
    ANY-ACCESS-MODIFIER Object readResolve() throws ObjectStreamException;```
    ? Should we throw objectStream Exception?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2403: Core: Add SerializedTable and SerializedMetadataTable

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#discussion_r606435257



##########
File path: core/src/main/java/org/apache/iceberg/SerializedTable.java
##########
@@ -0,0 +1,312 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.encryption.EncryptionManager;
+import org.apache.iceberg.hadoop.HadoopFileIO;
+import org.apache.iceberg.hadoop.SerializedConfiguration;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.LocationProvider;
+import org.apache.iceberg.util.SerializableMap;
+
+/**
+ * A read-only serializable table implementation that can be sent to other nodes.
+ * <p>
+ * While this class captures the metadata file location that can be used to load the full table
+ * metadata, it also serializes directly things like properties, schema, spec, sort order
+ * to avoid reading the metadata file from other nodes to access frequently needed metadata.

Review comment:
       Because `FileIO` and `LocationProvider` can all be dynamically loaded, and user can also potentially have a custom `EncryptionManager`, we should still mention in documentation that the end user needs to ensure Kryo compatibility in case they use custom implementations.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2403: Core: Add SerializedTable and SerializedMetadataTable

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#discussion_r606536996



##########
File path: core/src/main/java/org/apache/iceberg/hadoop/SerializedConfiguration.java
##########
@@ -0,0 +1,57 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.hadoop;
+
+import java.io.Serializable;
+import java.util.Map;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+
+/**
+ * A class that captures the current state of a Hadoop configuration in a serializable manner.
+ * <p>
+ * <em>Note:</em> Unlike {@link SerializableConfiguration}, no subsequent changes to the underlying
+ * Hadoop configuration will be propagated to the captured state once an instance
+ * of {@link SerializedConfiguration} is constructed.
+ */
+public class SerializedConfiguration implements Serializable {
+
+  private final Map<String, String> confAsMap;
+  private transient volatile Configuration conf = null;
+
+  public SerializedConfiguration(Configuration conf) {
+    this.confAsMap = Maps.newHashMapWithExpectedSize(conf.size());
+    conf.forEach(entry -> confAsMap.put(entry.getKey(), entry.getValue()));

Review comment:
       I think this should also set `conf` to the one that is passed in. That way this won't create a new configuration if `get` is called before the object is serialized.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2403: Core: Add SerializedTable and SerializedMetadataTable

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#discussion_r606469713



##########
File path: core/src/main/java/org/apache/iceberg/SerializedTable.java
##########
@@ -0,0 +1,312 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.encryption.EncryptionManager;
+import org.apache.iceberg.hadoop.HadoopFileIO;
+import org.apache.iceberg.hadoop.SerializedConfiguration;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.LocationProvider;
+import org.apache.iceberg.util.SerializableMap;
+
+/**
+ * A read-only serializable table implementation that can be sent to other nodes.
+ * <p>
+ * While this class captures the metadata file location that can be used to load the full table
+ * metadata, it also serializes directly things like properties, schema, spec, sort order
+ * to avoid reading the metadata file from other nodes to access frequently needed metadata.
+ * <p>
+ * <em>Note:</em> loading the full metadata from a large number of nodes can overwhelm the storage.
+ */
+public class SerializedTable implements Table, Serializable {
+  private final String name;
+  private final String location;
+  private final String metadataFileLocation;
+  private final Map<String, String> properties;
+  private final String schemaAsJson;
+  private final String specAsJson;
+  private final String sortOrderAsJson;
+  private final FileIO io;
+  private final EncryptionManager encryption;
+  private final LocationProvider locationProvider;
+
+  private transient volatile Table lazyTable = null;
+  private transient volatile Schema lazySchema = null;
+  private transient volatile PartitionSpec lazySpec = null;
+  private transient volatile SortOrder lazySortOrder = null;
+
+  public SerializedTable(Table table) {
+    this.name = table.name();
+    this.location = table.location();
+    this.metadataFileLocation = metadataFileLocation(table);
+    this.properties = SerializableMap.copyOf(table.properties());
+    this.schemaAsJson = SchemaParser.toJson(table.schema());
+    this.specAsJson = PartitionSpecParser.toJson(table.spec());
+    this.sortOrderAsJson = SortOrderParser.toJson(table.sortOrder());
+    this.io = fileIO(table);
+    this.encryption = table.encryption();
+    this.locationProvider = table.locationProvider();
+  }
+
+  private String metadataFileLocation(Table table) {
+    if (table instanceof HasTableOperations) {
+      TableOperations ops = ((HasTableOperations) table).operations();
+      return ops.current().metadataFileLocation();
+    } else {
+      return null;
+    }
+  }
+
+  private FileIO fileIO(Table table) {

Review comment:
       Cool, let's discuss in another PR then. Just to throw another idea I just have:
   
   ```java
   public interface KryoSerializable<T> extends Serializable {
     
     // return a serialized version of self
     default T serialized() {
       throw new UnsupportedOperationException("Cannot support kryo serialization");
     }
   }
   ```
   
   ```java
   public interface FileIO extends KryoSerializable<FileIO> {
    ...
   }
   ```
   
   ```java
   public class HadoopFileIO ... {
     @Override
     public FileIO serialized() {
       SerializedConfiguration serializedConf = new SerializedConfiguration(getConf());
       return new HadoopFileIO(serializedConf::get);
     }
   }
   ```
   
   By doing this, we enforce everything dynamically loaded to be kryo serializable.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2403: Core: Add SerializedTable and SerializedMetadataTable

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#discussion_r605808192



##########
File path: core/src/main/java/org/apache/iceberg/SerializedTable.java
##########
@@ -0,0 +1,304 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.encryption.EncryptionManager;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.LocationProvider;
+import org.apache.iceberg.util.SerializableMap;
+
+/**
+ * A read-only serializable table implementation that can be sent to other nodes.
+ * <p>
+ * While this class captures the metadata file location that can be used to load the full table
+ * metadata, it also serializes directly things like properties, schema, spec, sort order
+ * to avoid reading the metadata file from other nodes to access frequently needed metadata.
+ * <p>
+ * <em>Note:</em> loading the full metadata from a large number of nodes can overwhelm the storage.
+ */
+public class SerializedTable implements Table, Serializable {
+  private final String name;
+  private final String location;
+  private final String metadataFileLocation;
+  private final Map<String, String> properties;
+  private final String schemaAsJson;
+  private final String specAsJson;
+  private final String sortOrderAsJson;
+  private final FileIO io;
+  private final EncryptionManager encryption;
+  private final LocationProvider locationProvider;
+
+  private transient volatile Table lazyTable = null;
+  private transient volatile Schema lazySchema = null;
+  private transient volatile PartitionSpec lazySpec = null;
+  private transient volatile SortOrder lazySortOrder = null;
+
+  public SerializedTable(Table table) {
+    this(table, table.io());
+  }
+
+  public SerializedTable(Table table, FileIO io) {
+    this.name = table.name();
+    this.location = table.location();
+    this.metadataFileLocation = metadataFileLocation(table);
+    this.properties = SerializableMap.copyOf(table.properties());
+    this.schemaAsJson = SchemaParser.toJson(table.schema());
+    this.specAsJson = PartitionSpecParser.toJson(table.spec());
+    this.sortOrderAsJson = SortOrderParser.toJson(table.sortOrder());
+    this.io = io;
+    this.encryption = table.encryption();
+    this.locationProvider = table.locationProvider();
+  }
+
+  private String metadataFileLocation(Table table) {
+    if (table instanceof HasTableOperations) {
+      TableOperations ops = ((HasTableOperations) table).operations();
+      return ops.current().metadataFileLocation();
+    } else {
+      return null;

Review comment:
       Tables that don't implement `HasTableOperations` will not be able to load all metadata but would still work.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] stevenzwu commented on a change in pull request #2403: Core: Add SerializableTableFactory

Posted by GitBox <gi...@apache.org>.
stevenzwu commented on a change in pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#discussion_r607206504



##########
File path: spark/src/test/java/org/apache/iceberg/KryoHelpers.java
##########
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import com.esotericsoftware.kryo.Kryo;
+import com.esotericsoftware.kryo.io.Input;
+import com.esotericsoftware.kryo.io.Output;
+import java.io.ByteArrayInputStream;
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+import java.io.ObjectInputStream;
+import java.io.ObjectOutputStream;
+import org.apache.spark.SparkConf;
+import org.apache.spark.serializer.KryoSerializer;
+
+public class KryoHelpers {

Review comment:
       got it. But those tests don't seem tied to Spark Engine. Ideally, they probably should live in `iceberg-core` module. It is probably fine to add `Kryo` to test compile dep in the `iceberg-core` module?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] stevenzwu commented on a change in pull request #2403: Core: Add SerializableTableFactory

Posted by GitBox <gi...@apache.org>.
stevenzwu commented on a change in pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#discussion_r606916340



##########
File path: core/src/main/java/org/apache/iceberg/SerializableTableFactory.java
##########
@@ -0,0 +1,383 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.List;
+import java.util.Map;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.iceberg.encryption.EncryptionManager;
+import org.apache.iceberg.hadoop.HadoopFileIO;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.LocationProvider;
+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+import org.apache.iceberg.util.SerializableMap;
+
+/**
+ * A factory to create serializable tables.
+ */
+public class SerializableTableFactory {

Review comment:
       this doesn't look like a traditional factory class. We probably don't need this wrapper and directly make the `SerializableTable` a public class.
   
   Then we can either have a constructor like `SerializableTable(Table t)` or have a static `public staticSerializableTable#from(Table t)`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2403: Core: Add SerializedTable and SerializedMetadataTable

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#discussion_r606420308



##########
File path: core/src/main/java/org/apache/iceberg/SerializedTable.java
##########
@@ -0,0 +1,312 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.encryption.EncryptionManager;
+import org.apache.iceberg.hadoop.HadoopFileIO;
+import org.apache.iceberg.hadoop.SerializedConfiguration;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.LocationProvider;
+import org.apache.iceberg.util.SerializableMap;
+
+/**
+ * A read-only serializable table implementation that can be sent to other nodes.
+ * <p>
+ * While this class captures the metadata file location that can be used to load the full table
+ * metadata, it also serializes directly things like properties, schema, spec, sort order
+ * to avoid reading the metadata file from other nodes to access frequently needed metadata.
+ * <p>
+ * <em>Note:</em> loading the full metadata from a large number of nodes can overwhelm the storage.
+ */
+public class SerializedTable implements Table, Serializable {
+  private final String name;
+  private final String location;
+  private final String metadataFileLocation;
+  private final Map<String, String> properties;
+  private final String schemaAsJson;
+  private final String specAsJson;
+  private final String sortOrderAsJson;
+  private final FileIO io;
+  private final EncryptionManager encryption;
+  private final LocationProvider locationProvider;
+
+  private transient volatile Table lazyTable = null;
+  private transient volatile Schema lazySchema = null;
+  private transient volatile PartitionSpec lazySpec = null;
+  private transient volatile SortOrder lazySortOrder = null;
+
+  public SerializedTable(Table table) {
+    this.name = table.name();
+    this.location = table.location();
+    this.metadataFileLocation = metadataFileLocation(table);
+    this.properties = SerializableMap.copyOf(table.properties());
+    this.schemaAsJson = SchemaParser.toJson(table.schema());
+    this.specAsJson = PartitionSpecParser.toJson(table.spec());
+    this.sortOrderAsJson = SortOrderParser.toJson(table.sortOrder());
+    this.io = fileIO(table);
+    this.encryption = table.encryption();
+    this.locationProvider = table.locationProvider();
+  }
+
+  private String metadataFileLocation(Table table) {
+    if (table instanceof HasTableOperations) {
+      TableOperations ops = ((HasTableOperations) table).operations();
+      return ops.current().metadataFileLocation();
+    } else {
+      return null;
+    }
+  }
+
+  private FileIO fileIO(Table table) {

Review comment:
       Yeah, we would need a way to construct a copy of the class with the serialized conf.
   
   If we knew that all `FileIO` implementations can be dynamically loaded, we could simply persist the class, props, and optional conf and rebuild it on demand. However, we cannot guarantee all `FileIO` implementations can be dynamically loaded. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2403: Core: Add SerializedTable and SerializedMetadataTable

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#discussion_r605808192



##########
File path: core/src/main/java/org/apache/iceberg/SerializedTable.java
##########
@@ -0,0 +1,304 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.encryption.EncryptionManager;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.LocationProvider;
+import org.apache.iceberg.util.SerializableMap;
+
+/**
+ * A read-only serializable table implementation that can be sent to other nodes.
+ * <p>
+ * While this class captures the metadata file location that can be used to load the full table
+ * metadata, it also serializes directly things like properties, schema, spec, sort order
+ * to avoid reading the metadata file from other nodes to access frequently needed metadata.
+ * <p>
+ * <em>Note:</em> loading the full metadata from a large number of nodes can overwhelm the storage.
+ */
+public class SerializedTable implements Table, Serializable {
+  private final String name;
+  private final String location;
+  private final String metadataFileLocation;
+  private final Map<String, String> properties;
+  private final String schemaAsJson;
+  private final String specAsJson;
+  private final String sortOrderAsJson;
+  private final FileIO io;
+  private final EncryptionManager encryption;
+  private final LocationProvider locationProvider;
+
+  private transient volatile Table lazyTable = null;
+  private transient volatile Schema lazySchema = null;
+  private transient volatile PartitionSpec lazySpec = null;
+  private transient volatile SortOrder lazySortOrder = null;
+
+  public SerializedTable(Table table) {
+    this(table, table.io());
+  }
+
+  public SerializedTable(Table table, FileIO io) {
+    this.name = table.name();
+    this.location = table.location();
+    this.metadataFileLocation = metadataFileLocation(table);
+    this.properties = SerializableMap.copyOf(table.properties());
+    this.schemaAsJson = SchemaParser.toJson(table.schema());
+    this.specAsJson = PartitionSpecParser.toJson(table.spec());
+    this.sortOrderAsJson = SortOrderParser.toJson(table.sortOrder());
+    this.io = io;
+    this.encryption = table.encryption();
+    this.locationProvider = table.locationProvider();
+  }
+
+  private String metadataFileLocation(Table table) {
+    if (table instanceof HasTableOperations) {
+      TableOperations ops = ((HasTableOperations) table).operations();
+      return ops.current().metadataFileLocation();
+    } else {
+      return null;

Review comment:
       Tables that don't implement `HasTableOperations` will not be able to load full metadata but would still be serialized and deserialized correctly.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2403: Core: Add SerializedTable and SerializedMetadataTable

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#discussion_r606458455



##########
File path: core/src/main/java/org/apache/iceberg/SerializedTable.java
##########
@@ -0,0 +1,312 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.encryption.EncryptionManager;
+import org.apache.iceberg.hadoop.HadoopFileIO;
+import org.apache.iceberg.hadoop.SerializedConfiguration;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.LocationProvider;
+import org.apache.iceberg.util.SerializableMap;
+
+/**
+ * A read-only serializable table implementation that can be sent to other nodes.
+ * <p>
+ * While this class captures the metadata file location that can be used to load the full table
+ * metadata, it also serializes directly things like properties, schema, spec, sort order
+ * to avoid reading the metadata file from other nodes to access frequently needed metadata.
+ * <p>
+ * <em>Note:</em> loading the full metadata from a large number of nodes can overwhelm the storage.
+ */
+public class SerializedTable implements Table, Serializable {
+  private final String name;
+  private final String location;
+  private final String metadataFileLocation;
+  private final Map<String, String> properties;
+  private final String schemaAsJson;
+  private final String specAsJson;
+  private final String sortOrderAsJson;
+  private final FileIO io;
+  private final EncryptionManager encryption;
+  private final LocationProvider locationProvider;
+
+  private transient volatile Table lazyTable = null;
+  private transient volatile Schema lazySchema = null;
+  private transient volatile PartitionSpec lazySpec = null;
+  private transient volatile SortOrder lazySortOrder = null;
+
+  public SerializedTable(Table table) {
+    this.name = table.name();
+    this.location = table.location();
+    this.metadataFileLocation = metadataFileLocation(table);
+    this.properties = SerializableMap.copyOf(table.properties());
+    this.schemaAsJson = SchemaParser.toJson(table.schema());
+    this.specAsJson = PartitionSpecParser.toJson(table.spec());
+    this.sortOrderAsJson = SortOrderParser.toJson(table.sortOrder());
+    this.io = fileIO(table);
+    this.encryption = table.encryption();
+    this.locationProvider = table.locationProvider();
+  }
+
+  private String metadataFileLocation(Table table) {
+    if (table instanceof HasTableOperations) {
+      TableOperations ops = ((HasTableOperations) table).operations();
+      return ops.current().metadataFileLocation();
+    } else {
+      return null;
+    }
+  }
+
+  private FileIO fileIO(Table table) {

Review comment:
       Yeah, we would definitely need a more generic solution for `FileIO` implementations that depend on Hadoop conf. 
   I tried to prototype that but it raised a number of questions. The last idea I had was something like this:
   
   ```
   interface ConfigurableFileIO<T extends ConfigurableFileIO> extends FileIO, Configurable {
     T toSerializable(SerializedConfiguration serializedConf);
   
     default Object writeReplace() {
       // default the implementation to call `toSerializable`
     }
   }
   ```
   
   It was clear this requires a separate PR and more thinking. That's why I propose to address that in a follow-up. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2403: Core: Add SerializedTable and SerializedMetadataTable

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#discussion_r606416112



##########
File path: core/src/main/java/org/apache/iceberg/SerializedTable.java
##########
@@ -0,0 +1,312 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.encryption.EncryptionManager;
+import org.apache.iceberg.hadoop.HadoopFileIO;
+import org.apache.iceberg.hadoop.SerializedConfiguration;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.LocationProvider;
+import org.apache.iceberg.util.SerializableMap;
+
+/**
+ * A read-only serializable table implementation that can be sent to other nodes.
+ * <p>
+ * While this class captures the metadata file location that can be used to load the full table
+ * metadata, it also serializes directly things like properties, schema, spec, sort order
+ * to avoid reading the metadata file from other nodes to access frequently needed metadata.
+ * <p>
+ * <em>Note:</em> loading the full metadata from a large number of nodes can overwhelm the storage.
+ */
+public class SerializedTable implements Table, Serializable {
+  private final String name;
+  private final String location;
+  private final String metadataFileLocation;
+  private final Map<String, String> properties;
+  private final String schemaAsJson;
+  private final String specAsJson;
+  private final String sortOrderAsJson;
+  private final FileIO io;
+  private final EncryptionManager encryption;
+  private final LocationProvider locationProvider;
+
+  private transient volatile Table lazyTable = null;
+  private transient volatile Schema lazySchema = null;
+  private transient volatile PartitionSpec lazySpec = null;
+  private transient volatile SortOrder lazySortOrder = null;
+
+  public SerializedTable(Table table) {
+    this.name = table.name();
+    this.location = table.location();
+    this.metadataFileLocation = metadataFileLocation(table);
+    this.properties = SerializableMap.copyOf(table.properties());
+    this.schemaAsJson = SchemaParser.toJson(table.schema());
+    this.specAsJson = PartitionSpecParser.toJson(table.spec());
+    this.sortOrderAsJson = SortOrderParser.toJson(table.sortOrder());
+    this.io = fileIO(table);
+    this.encryption = table.encryption();
+    this.locationProvider = table.locationProvider();
+  }
+
+  private String metadataFileLocation(Table table) {
+    if (table instanceof HasTableOperations) {
+      TableOperations ops = ((HasTableOperations) table).operations();
+      return ops.current().metadataFileLocation();
+    } else {
+      return null;
+    }
+  }
+
+  private FileIO fileIO(Table table) {

Review comment:
       This can be made more generic. We can expose `ConfigurableFileIO` interface or something similar instead of only handling `HadoopFileIO`.

##########
File path: core/src/main/java/org/apache/iceberg/SerializedTable.java
##########
@@ -0,0 +1,312 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.encryption.EncryptionManager;
+import org.apache.iceberg.hadoop.HadoopFileIO;
+import org.apache.iceberg.hadoop.SerializedConfiguration;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.LocationProvider;
+import org.apache.iceberg.util.SerializableMap;
+
+/**
+ * A read-only serializable table implementation that can be sent to other nodes.
+ * <p>
+ * While this class captures the metadata file location that can be used to load the full table
+ * metadata, it also serializes directly things like properties, schema, spec, sort order
+ * to avoid reading the metadata file from other nodes to access frequently needed metadata.
+ * <p>
+ * <em>Note:</em> loading the full metadata from a large number of nodes can overwhelm the storage.
+ */
+public class SerializedTable implements Table, Serializable {
+  private final String name;
+  private final String location;
+  private final String metadataFileLocation;
+  private final Map<String, String> properties;
+  private final String schemaAsJson;
+  private final String specAsJson;
+  private final String sortOrderAsJson;
+  private final FileIO io;
+  private final EncryptionManager encryption;
+  private final LocationProvider locationProvider;
+
+  private transient volatile Table lazyTable = null;
+  private transient volatile Schema lazySchema = null;
+  private transient volatile PartitionSpec lazySpec = null;
+  private transient volatile SortOrder lazySortOrder = null;
+
+  public SerializedTable(Table table) {
+    this.name = table.name();
+    this.location = table.location();
+    this.metadataFileLocation = metadataFileLocation(table);
+    this.properties = SerializableMap.copyOf(table.properties());
+    this.schemaAsJson = SchemaParser.toJson(table.schema());
+    this.specAsJson = PartitionSpecParser.toJson(table.spec());
+    this.sortOrderAsJson = SortOrderParser.toJson(table.sortOrder());
+    this.io = fileIO(table);
+    this.encryption = table.encryption();
+    this.locationProvider = table.locationProvider();
+  }
+
+  private String metadataFileLocation(Table table) {
+    if (table instanceof HasTableOperations) {
+      TableOperations ops = ((HasTableOperations) table).operations();
+      return ops.current().metadataFileLocation();
+    } else {
+      return null;
+    }
+  }
+
+  private FileIO fileIO(Table table) {

Review comment:
       This can be made more generic in the future. We can expose `ConfigurableFileIO` interface or something similar instead of only handling `HadoopFileIO`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #2403: Core: Add SerializedTable and SerializedMetadataTable

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#issuecomment-812713312


   @RussellSpitzer @jackye1995, could you take one more look? I've updated the approach to make `SerializableTable` Kryo compatible. 
   
   cc @openinx @pvary @yyanyy @rdblue too


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] stevenzwu commented on a change in pull request #2403: Core: Add SerializedTable and SerializedMetadataTable

Posted by GitBox <gi...@apache.org>.
stevenzwu commented on a change in pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#discussion_r606625833



##########
File path: core/src/main/java/org/apache/iceberg/SerializedTable.java
##########
@@ -0,0 +1,344 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.List;
+import java.util.Map;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.iceberg.encryption.EncryptionManager;
+import org.apache.iceberg.hadoop.HadoopFileIO;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.LocationProvider;
+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+import org.apache.iceberg.util.SerializableMap;
+
+/**
+ * A read-only serializable table implementation that can be sent to other nodes.
+ * <p>
+ * While this class captures the metadata file location that can be used to load the full table
+ * metadata, it also serializes directly things like properties, schema, spec, sort order
+ * to avoid reading the metadata file from other nodes to access frequently needed metadata.
+ * <p>
+ * This implementation assumes the passed instances of {@link FileIO}, {@link EncryptionManager},
+ * {@link LocationProvider} are serializable. If you are serializing the table using a custom
+ * serialization framework like Kryo, those instances of {@link FileIO}, {@link EncryptionManager},
+ * {@link LocationProvider} must be supported by that particular serialization framework.
+ * <p>
+ * <em>Note:</em> loading the full metadata from a large number of nodes can overwhelm the storage.
+ */
+public class SerializedTable implements Table, Serializable {

Review comment:
       nit: is `SerializableTable` a more accurate name? `Serialized` gives the impression of byte[] like serialized format.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] stevenzwu commented on a change in pull request #2403: Core: Add SerializableTableFactory

Posted by GitBox <gi...@apache.org>.
stevenzwu commented on a change in pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#discussion_r606916340



##########
File path: core/src/main/java/org/apache/iceberg/SerializableTableFactory.java
##########
@@ -0,0 +1,383 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.List;
+import java.util.Map;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.iceberg.encryption.EncryptionManager;
+import org.apache.iceberg.hadoop.HadoopFileIO;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.LocationProvider;
+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+import org.apache.iceberg.util.SerializableMap;
+
+/**
+ * A factory to create serializable tables.
+ */
+public class SerializableTableFactory {

Review comment:
       this doesn't look like a traditional factory class. We probably don't need this wrapper and directly make the `SerializableTable` a public class.
   
   Then we can either have a constructor like `SerializableTable(Table t)` or have a static `public staticSerializableTable#copyOf(Table t)`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2403: Core: Add SerializableTableFactory

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#discussion_r606970032



##########
File path: core/src/main/java/org/apache/iceberg/SerializableTableFactory.java
##########
@@ -0,0 +1,383 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.List;
+import java.util.Map;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.iceberg.encryption.EncryptionManager;
+import org.apache.iceberg.hadoop.HadoopFileIO;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.LocationProvider;
+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+import org.apache.iceberg.util.SerializableMap;
+
+/**
+ * A factory to create serializable tables.
+ */
+public class SerializableTableFactory {

Review comment:
       We have two table implementations: one for base and one for metadata tables. I don't think making those two classes public is a good idea. We need to do a switch somewhere and expose a method for building serializable tables.
   
   While we could name the class like `SerializableTable` and do a switch there, it may be a bit weird that it does not implement `Table`. Let me think more about this. Maybe, that's fine.
   
   ```
   SerializableTable.copyOf(table)
   SerializableTables.copyOf(table)
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2403: Core: Add SerializedTable and SerializedMetadataTable

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#discussion_r606417163



##########
File path: core/src/main/java/org/apache/iceberg/hadoop/SerializedConfiguration.java
##########
@@ -0,0 +1,57 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.hadoop;
+
+import java.io.Serializable;
+import java.util.Map;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+
+/**
+ * A class that captures the current state of a Hadoop configuration in a serializable manner.
+ * <p>
+ * <em>Note:</em> Unlike {@link SerializableConfiguration}, no subsequent changes to the underlying
+ * Hadoop configuration will be propagated to the captured state once an instance
+ * of {@link SerializedConfiguration} is constructed.
+ */
+public class SerializedConfiguration implements Serializable {

Review comment:
       We cannot just replace the internal implementation of the existing `SerializableConfiguration` class as Hadoop conf is not immutable and can dynamically change. For example, getting `FileSystem` would trigger a full reload of configs. That's why I added `SerializedConfiguration` that is used only in `SerializedTable` for now.
   
   Later, we can use `SerializedConfiguration` in `writeReplace` of `SerializableConfiguration`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2403: Core: Add SerializableTableFactory

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#discussion_r606970497



##########
File path: spark/src/test/java/org/apache/iceberg/KryoHelpers.java
##########
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import com.esotericsoftware.kryo.Kryo;
+import com.esotericsoftware.kryo.io.Input;
+import com.esotericsoftware.kryo.io.Output;
+import java.io.ByteArrayInputStream;
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+import java.io.ObjectInputStream;
+import java.io.ObjectOutputStream;
+import org.apache.spark.SparkConf;
+import org.apache.spark.serializer.KryoSerializer;
+
+public class KryoHelpers {

Review comment:
       We depend on Kryo from Spark. Iceberg does not bundle it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2403: Core: Add SerializedTable and SerializedMetadataTable

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#discussion_r605775334



##########
File path: core/src/test/java/org/apache/iceberg/hadoop/TestTableSerialization.java
##########
@@ -140,4 +194,21 @@ private static Table getMetaDataTable(Table table, MetadataTableType type) {
       throw new RuntimeException("Could not read object ", e);
     }
   }
+
+  private void checkTable(Table expected, Table actual, boolean completeCheck) {

Review comment:
       I would just do two checkTableFunctions rather than the boolean behavior switch but that's just me :)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2403: Core: Add SerializedTable and SerializedMetadataTable

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#discussion_r605842687



##########
File path: core/src/main/java/org/apache/iceberg/BaseTransaction.java
##########
@@ -679,6 +680,10 @@ public LocationProvider locationProvider() {
     public String toString() {
       return name();
     }
+
+    Object writeReplace() {

Review comment:
       Yep yep.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2403: Core: Add SerializedTable and SerializedMetadataTable

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#discussion_r605842145



##########
File path: core/src/main/java/org/apache/iceberg/SerializedTable.java
##########
@@ -0,0 +1,304 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.encryption.EncryptionManager;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.LocationProvider;
+import org.apache.iceberg.util.SerializableMap;
+
+/**
+ * A read-only serializable table implementation that can be sent to other nodes.
+ * <p>
+ * While this class captures the metadata file location that can be used to load the full table
+ * metadata, it also serializes directly things like properties, schema, spec, sort order
+ * to avoid reading the metadata file from other nodes to access frequently needed metadata.
+ * <p>
+ * <em>Note:</em> loading the full metadata from a large number of nodes can overwhelm the storage.
+ */
+public class SerializedTable implements Table, Serializable {

Review comment:
       Good point, added.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #2403: Core: Add SerializableTable

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#issuecomment-813669731


   Thanks for reviewing, @stevenzwu @rdblue @RussellSpitzer @jackye1995!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2403: Core: Add SerializableTableFactory

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#discussion_r607209376



##########
File path: spark/src/test/java/org/apache/iceberg/KryoHelpers.java
##########
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import com.esotericsoftware.kryo.Kryo;
+import com.esotericsoftware.kryo.io.Input;
+import com.esotericsoftware.kryo.io.Output;
+import java.io.ByteArrayInputStream;
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+import java.io.ObjectInputStream;
+import java.io.ObjectOutputStream;
+import org.apache.spark.SparkConf;
+import org.apache.spark.serializer.KryoSerializer;
+
+public class KryoHelpers {

Review comment:
       I am afraid each query engine has its own specifics. For example, Spark adds custom serializers for handling unmodifiable collections in Java while Flink does not, which led to exceptions on the Flink side.
   
   There is a number of Kryo related suites in query engine modules. I'd up for refactoring but probably in a separate PR.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2403: Core: Add SerializedTable and SerializedMetadataTable

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#discussion_r606541996



##########
File path: core/src/main/java/org/apache/iceberg/hadoop/SerializedConfiguration.java
##########
@@ -0,0 +1,57 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.hadoop;
+
+import java.io.Serializable;
+import java.util.Map;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+
+/**
+ * A class that captures the current state of a Hadoop configuration in a serializable manner.
+ * <p>
+ * <em>Note:</em> Unlike {@link SerializableConfiguration}, no subsequent changes to the underlying
+ * Hadoop configuration will be propagated to the captured state once an instance
+ * of {@link SerializedConfiguration} is constructed.
+ */
+public class SerializedConfiguration implements Serializable {
+
+  private final Map<String, String> confAsMap;
+  private transient volatile Configuration conf = null;
+
+  public SerializedConfiguration(Configuration conf) {
+    this.confAsMap = Maps.newHashMapWithExpectedSize(conf.size());
+    conf.forEach(entry -> confAsMap.put(entry.getKey(), entry.getValue()));

Review comment:
       If we do this, there will be no guarantee the conf and map are in sync. This class should be used right before the serialization and should not be accessed before it is serialized.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2403: Core: Add SerializedTable and SerializedMetadataTable

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#discussion_r605806857



##########
File path: core/src/main/java/org/apache/iceberg/BaseTransaction.java
##########
@@ -679,6 +680,10 @@ public LocationProvider locationProvider() {
     public String toString() {
       return name();
     }
+
+    Object writeReplace() {

Review comment:
       I am not sure but I thought Java serialization requires this signature.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi edited a comment on pull request #2403: Core: Add SerializedTable and SerializedMetadataTable

Posted by GitBox <gi...@apache.org>.
aokolnychyi edited a comment on pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#issuecomment-812713312


   @RussellSpitzer @jackye1995, could you take one more look? I've updated the approach to make `SerializedTable` Kryo compatible. 
   
   cc @openinx @pvary @yyanyy @rdblue too


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2403: Core: Add SerializedTable and SerializedMetadataTable

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#discussion_r606458455



##########
File path: core/src/main/java/org/apache/iceberg/SerializedTable.java
##########
@@ -0,0 +1,312 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.encryption.EncryptionManager;
+import org.apache.iceberg.hadoop.HadoopFileIO;
+import org.apache.iceberg.hadoop.SerializedConfiguration;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.LocationProvider;
+import org.apache.iceberg.util.SerializableMap;
+
+/**
+ * A read-only serializable table implementation that can be sent to other nodes.
+ * <p>
+ * While this class captures the metadata file location that can be used to load the full table
+ * metadata, it also serializes directly things like properties, schema, spec, sort order
+ * to avoid reading the metadata file from other nodes to access frequently needed metadata.
+ * <p>
+ * <em>Note:</em> loading the full metadata from a large number of nodes can overwhelm the storage.
+ */
+public class SerializedTable implements Table, Serializable {
+  private final String name;
+  private final String location;
+  private final String metadataFileLocation;
+  private final Map<String, String> properties;
+  private final String schemaAsJson;
+  private final String specAsJson;
+  private final String sortOrderAsJson;
+  private final FileIO io;
+  private final EncryptionManager encryption;
+  private final LocationProvider locationProvider;
+
+  private transient volatile Table lazyTable = null;
+  private transient volatile Schema lazySchema = null;
+  private transient volatile PartitionSpec lazySpec = null;
+  private transient volatile SortOrder lazySortOrder = null;
+
+  public SerializedTable(Table table) {
+    this.name = table.name();
+    this.location = table.location();
+    this.metadataFileLocation = metadataFileLocation(table);
+    this.properties = SerializableMap.copyOf(table.properties());
+    this.schemaAsJson = SchemaParser.toJson(table.schema());
+    this.specAsJson = PartitionSpecParser.toJson(table.spec());
+    this.sortOrderAsJson = SortOrderParser.toJson(table.sortOrder());
+    this.io = fileIO(table);
+    this.encryption = table.encryption();
+    this.locationProvider = table.locationProvider();
+  }
+
+  private String metadataFileLocation(Table table) {
+    if (table instanceof HasTableOperations) {
+      TableOperations ops = ((HasTableOperations) table).operations();
+      return ops.current().metadataFileLocation();
+    } else {
+      return null;
+    }
+  }
+
+  private FileIO fileIO(Table table) {

Review comment:
       Yeah, we would definitely need a more generic solution for `FileIO` implementations that depend on Hadoop conf. 
   
   I tried to prototype that but it raised a number of questions. The last idea I had was something like this:
   
   ```
   interface ConfigurableFileIO<T extends ConfigurableFileIO> extends FileIO, Configurable {
     T toSerializable(SerializedConfiguration serializedConf);
   
     default Object writeReplace() {
       // default the implementation to call `toSerializable`
     }
   }
   ```
   
   It was clear this requires a separate PR and more thinking. That's why I propose to address that in a follow-up. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2403: Core: Add SerializedTable and SerializedMetadataTable

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#discussion_r605842258



##########
File path: core/src/test/java/org/apache/iceberg/hadoop/TestTableSerialization.java
##########
@@ -140,4 +194,21 @@ private static Table getMetaDataTable(Table table, MetadataTableType type) {
       throw new RuntimeException("Could not read object ", e);
     }
   }
+
+  private void checkTable(Table expected, Table actual, boolean completeCheck) {

Review comment:
       Fixed.

##########
File path: core/src/main/java/org/apache/iceberg/SortOrderParser.java
##########
@@ -50,6 +52,26 @@ public static void toJson(SortOrder sortOrder, JsonGenerator generator) throws I
     generator.writeEndObject();
   }
 
+  public static String toJson(SortOrder sortOrder) {
+    return toJson(sortOrder, false);
+  }
+
+  public static String toJson(SortOrder sortOrder, boolean pretty) {
+    try {
+      StringWriter writer = new StringWriter();
+      JsonGenerator generator = JsonUtil.factory().createGenerator(writer);
+      if (pretty) {
+        generator.useDefaultPrettyPrinter();
+      }
+      toJson(sortOrder, generator);
+      generator.flush();
+      return writer.toString();
+
+    } catch (IOException e) {
+      throw new RuntimeIOException(e);

Review comment:
       Fixed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2403: Core: Add SerializedTable and SerializedMetadataTable

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#discussion_r606415750



##########
File path: core/src/main/java/org/apache/iceberg/SerializedTable.java
##########
@@ -0,0 +1,312 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.encryption.EncryptionManager;
+import org.apache.iceberg.hadoop.HadoopFileIO;
+import org.apache.iceberg.hadoop.SerializedConfiguration;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.LocationProvider;
+import org.apache.iceberg.util.SerializableMap;
+
+/**
+ * A read-only serializable table implementation that can be sent to other nodes.
+ * <p>
+ * While this class captures the metadata file location that can be used to load the full table
+ * metadata, it also serializes directly things like properties, schema, spec, sort order
+ * to avoid reading the metadata file from other nodes to access frequently needed metadata.
+ * <p>
+ * <em>Note:</em> loading the full metadata from a large number of nodes can overwhelm the storage.
+ */
+public class SerializedTable implements Table, Serializable {
+  private final String name;
+  private final String location;
+  private final String metadataFileLocation;
+  private final Map<String, String> properties;
+  private final String schemaAsJson;
+  private final String specAsJson;
+  private final String sortOrderAsJson;
+  private final FileIO io;
+  private final EncryptionManager encryption;
+  private final LocationProvider locationProvider;
+
+  private transient volatile Table lazyTable = null;
+  private transient volatile Schema lazySchema = null;
+  private transient volatile PartitionSpec lazySpec = null;
+  private transient volatile SortOrder lazySortOrder = null;
+
+  public SerializedTable(Table table) {
+    this.name = table.name();
+    this.location = table.location();
+    this.metadataFileLocation = metadataFileLocation(table);
+    this.properties = SerializableMap.copyOf(table.properties());
+    this.schemaAsJson = SchemaParser.toJson(table.schema());
+    this.specAsJson = PartitionSpecParser.toJson(table.spec());
+    this.sortOrderAsJson = SortOrderParser.toJson(table.sortOrder());
+    this.io = fileIO(table);
+    this.encryption = table.encryption();
+    this.locationProvider = table.locationProvider();
+  }
+
+  private String metadataFileLocation(Table table) {
+    if (table instanceof HasTableOperations) {
+      TableOperations ops = ((HasTableOperations) table).operations();
+      return ops.current().metadataFileLocation();
+    } else {
+      return null;
+    }
+  }
+
+  private FileIO fileIO(Table table) {

Review comment:
       I've changed this place to handle `FileIO` instead of delegating it to the caller.
   We rely on `SerializedConfiguration` that is introduced in this PR.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2403: Core: Add SerializedTable and SerializedMetadataTable

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#discussion_r605802102



##########
File path: core/src/main/java/org/apache/iceberg/SortOrderParser.java
##########
@@ -50,6 +52,26 @@ public static void toJson(SortOrder sortOrder, JsonGenerator generator) throws I
     generator.writeEndObject();
   }
 
+  public static String toJson(SortOrder sortOrder) {
+    return toJson(sortOrder, false);
+  }
+
+  public static String toJson(SortOrder sortOrder, boolean pretty) {
+    try {
+      StringWriter writer = new StringWriter();
+      JsonGenerator generator = JsonUtil.factory().createGenerator(writer);
+      if (pretty) {
+        generator.useDefaultPrettyPrinter();
+      }
+      toJson(sortOrder, generator);
+      generator.flush();
+      return writer.toString();
+
+    } catch (IOException e) {
+      throw new RuntimeIOException(e);

Review comment:
       That was copy & paste error. I'll update it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2403: Core: Add SerializedTable and SerializedMetadataTable

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#discussion_r605801915



##########
File path: core/src/main/java/org/apache/iceberg/BaseTransaction.java
##########
@@ -679,6 +680,10 @@ public LocationProvider locationProvider() {
     public String toString() {
       return name();
     }
+
+    Object writeReplace() {

Review comment:
       It is not clear to me from the name of this method what it does, I assume the return type is Object so we don't have to parameterize on the subtype?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi edited a comment on pull request #2403: Core: Add SerializedTable and SerializedMetadataTable

Posted by GitBox <gi...@apache.org>.
aokolnychyi edited a comment on pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#issuecomment-812807604


   Since I redesigned the original approach, we may consider making serialized table classes non-public and offering a factory or a util class to build them instead. We will need to construct serialized tables manually for Kryo.
   
   ```
   SerializedTableFactory.newSerializedTable(table)
   ```
   
   I don't want to call it a util as it has to be in `org.apache.iceberg` package.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2403: Core: Add SerializedTable and SerializedMetadataTable

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#discussion_r606739444



##########
File path: core/src/main/java/org/apache/iceberg/SerializedTable.java
##########
@@ -0,0 +1,344 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.List;
+import java.util.Map;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.iceberg.encryption.EncryptionManager;
+import org.apache.iceberg.hadoop.HadoopFileIO;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.LocationProvider;
+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+import org.apache.iceberg.util.SerializableMap;
+
+/**
+ * A read-only serializable table implementation that can be sent to other nodes.
+ * <p>
+ * While this class captures the metadata file location that can be used to load the full table
+ * metadata, it also serializes directly things like properties, schema, spec, sort order
+ * to avoid reading the metadata file from other nodes to access frequently needed metadata.
+ * <p>
+ * This implementation assumes the passed instances of {@link FileIO}, {@link EncryptionManager},
+ * {@link LocationProvider} are serializable. If you are serializing the table using a custom
+ * serialization framework like Kryo, those instances of {@link FileIO}, {@link EncryptionManager},
+ * {@link LocationProvider} must be supported by that particular serialization framework.
+ * <p>
+ * <em>Note:</em> loading the full metadata from a large number of nodes can overwhelm the storage.
+ */
+public class SerializedTable implements Table, Serializable {

Review comment:
       Well, yes and no. It is not really serialized like you pointed out but I wanted to somehow convey its limitations and that it should be constructed right before the serialization. After thinking more, I think I agree with your reasoning. Let me update it. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2403: Core: Add SerializedTable and SerializedMetadataTable

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#discussion_r606570164



##########
File path: core/src/main/java/org/apache/iceberg/hadoop/SerializedConfiguration.java
##########
@@ -0,0 +1,57 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.hadoop;
+
+import java.io.Serializable;
+import java.util.Map;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+
+/**
+ * A class that captures the current state of a Hadoop configuration in a serializable manner.
+ * <p>
+ * <em>Note:</em> Unlike {@link SerializableConfiguration}, no subsequent changes to the underlying
+ * Hadoop configuration will be propagated to the captured state once an instance
+ * of {@link SerializedConfiguration} is constructed.
+ */
+public class SerializedConfiguration implements Serializable {
+
+  private final Map<String, String> confAsMap;
+  private transient volatile Configuration conf = null;
+
+  public SerializedConfiguration(Configuration conf) {
+    this.confAsMap = Maps.newHashMapWithExpectedSize(conf.size());
+    conf.forEach(entry -> confAsMap.put(entry.getKey(), entry.getValue()));

Review comment:
       I moved this class to `SerializedTable` to hide it and make sure nobody uses it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2403: Core: Add SerializedTable and SerializedMetadataTable

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#discussion_r605808214



##########
File path: core/src/main/java/org/apache/iceberg/SerializedTable.java
##########
@@ -0,0 +1,304 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.encryption.EncryptionManager;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.LocationProvider;
+import org.apache.iceberg.util.SerializableMap;
+
+/**
+ * A read-only serializable table implementation that can be sent to other nodes.
+ * <p>
+ * While this class captures the metadata file location that can be used to load the full table
+ * metadata, it also serializes directly things like properties, schema, spec, sort order
+ * to avoid reading the metadata file from other nodes to access frequently needed metadata.
+ * <p>
+ * <em>Note:</em> loading the full metadata from a large number of nodes can overwhelm the storage.
+ */
+public class SerializedTable implements Table, Serializable {

Review comment:
       Probably should add this comment to the Javadoc




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2403: Core: Add SerializedTable and SerializedMetadataTable

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#discussion_r605802414



##########
File path: core/src/test/java/org/apache/iceberg/hadoop/TestTableSerialization.java
##########
@@ -140,4 +194,21 @@ private static Table getMetaDataTable(Table table, MetadataTableType type) {
       throw new RuntimeException("Could not read object ", e);
     }
   }
+
+  private void checkTable(Table expected, Table actual, boolean completeCheck) {

Review comment:
       The boolean flag isn't descriptive. I'll switch to two methods instead. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2403: Core: Add SerializedTable and SerializedMetadataTable

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#discussion_r606420308



##########
File path: core/src/main/java/org/apache/iceberg/SerializedTable.java
##########
@@ -0,0 +1,312 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.encryption.EncryptionManager;
+import org.apache.iceberg.hadoop.HadoopFileIO;
+import org.apache.iceberg.hadoop.SerializedConfiguration;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.LocationProvider;
+import org.apache.iceberg.util.SerializableMap;
+
+/**
+ * A read-only serializable table implementation that can be sent to other nodes.
+ * <p>
+ * While this class captures the metadata file location that can be used to load the full table
+ * metadata, it also serializes directly things like properties, schema, spec, sort order
+ * to avoid reading the metadata file from other nodes to access frequently needed metadata.
+ * <p>
+ * <em>Note:</em> loading the full metadata from a large number of nodes can overwhelm the storage.
+ */
+public class SerializedTable implements Table, Serializable {
+  private final String name;
+  private final String location;
+  private final String metadataFileLocation;
+  private final Map<String, String> properties;
+  private final String schemaAsJson;
+  private final String specAsJson;
+  private final String sortOrderAsJson;
+  private final FileIO io;
+  private final EncryptionManager encryption;
+  private final LocationProvider locationProvider;
+
+  private transient volatile Table lazyTable = null;
+  private transient volatile Schema lazySchema = null;
+  private transient volatile PartitionSpec lazySpec = null;
+  private transient volatile SortOrder lazySortOrder = null;
+
+  public SerializedTable(Table table) {
+    this.name = table.name();
+    this.location = table.location();
+    this.metadataFileLocation = metadataFileLocation(table);
+    this.properties = SerializableMap.copyOf(table.properties());
+    this.schemaAsJson = SchemaParser.toJson(table.schema());
+    this.specAsJson = PartitionSpecParser.toJson(table.spec());
+    this.sortOrderAsJson = SortOrderParser.toJson(table.sortOrder());
+    this.io = fileIO(table);
+    this.encryption = table.encryption();
+    this.locationProvider = table.locationProvider();
+  }
+
+  private String metadataFileLocation(Table table) {
+    if (table instanceof HasTableOperations) {
+      TableOperations ops = ((HasTableOperations) table).operations();
+      return ops.current().metadataFileLocation();
+    } else {
+      return null;
+    }
+  }
+
+  private FileIO fileIO(Table table) {

Review comment:
       Yeah, we would need a way to construct a copy of the class with the serialized conf.
   
   If we knew that all `FileIO` implementations can be dynamically loaded, we could simply persist the class, props, and optional conf and rebuild it on demand. However, we cannot guarantee all `FileIO` implementations are dynamically loaded. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2403: Core: Add SerializedTable and SerializedMetadataTable

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#discussion_r606746022



##########
File path: core/src/main/java/org/apache/iceberg/SerializableTableFactory.java
##########
@@ -0,0 +1,383 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.List;
+import java.util.Map;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.iceberg.encryption.EncryptionManager;
+import org.apache.iceberg.hadoop.HadoopFileIO;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.LocationProvider;
+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+import org.apache.iceberg.util.SerializableMap;
+
+/**
+ * A factory to create serializable tables.
+ */
+public class SerializableTableFactory {
+
+  private SerializableTableFactory() {
+  }
+
+  /**
+   * Creates a read-only serializable table that can be sent to other nodes in a cluster.
+   * <p>
+   * The created table will copy the original table state in a serializable manner and will not
+   * reflect any subsequent changes to the original table.
+   * <p>
+   * While this class captures the metadata file location that can be used to load the complete
+   * table metadata, it directly persists the current schema, spec, sort order, table properties
+   * to avoid reading the metadata file from other nodes to access frequently needed metadata.
+   * <p>
+   * This implementation assumes the passed instances of {@link FileIO}, {@link EncryptionManager},
+   * {@link LocationProvider} are serializable. If you are serializing the table using a custom
+   * serialization framework like Kryo, those instances of {@link FileIO}, {@link EncryptionManager},
+   * {@link LocationProvider} must be supported by that particular serialization framework.
+   * <p>
+   * <em>Note:</em> loading the complete metadata from a large number of nodes can overwhelm the storage.
+   *
+   * @param table the original table to copy the state from
+   * @return a read-only serializable table reflecting the current state of the original table
+   */
+  public static Table copyOf(Table table) {

Review comment:
       Query engines that need Kryo support will call this method manually.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2403: Core: Add SerializedTable and SerializedMetadataTable

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#discussion_r606535048



##########
File path: core/src/main/java/org/apache/iceberg/SerializedTable.java
##########
@@ -0,0 +1,312 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.encryption.EncryptionManager;
+import org.apache.iceberg.hadoop.HadoopFileIO;
+import org.apache.iceberg.hadoop.SerializedConfiguration;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.LocationProvider;
+import org.apache.iceberg.util.SerializableMap;
+
+/**
+ * A read-only serializable table implementation that can be sent to other nodes.
+ * <p>
+ * While this class captures the metadata file location that can be used to load the full table
+ * metadata, it also serializes directly things like properties, schema, spec, sort order
+ * to avoid reading the metadata file from other nodes to access frequently needed metadata.
+ * <p>
+ * <em>Note:</em> loading the full metadata from a large number of nodes can overwhelm the storage.
+ */
+public class SerializedTable implements Table, Serializable {
+  private final String name;
+  private final String location;
+  private final String metadataFileLocation;
+  private final Map<String, String> properties;
+  private final String schemaAsJson;
+  private final String specAsJson;
+  private final String sortOrderAsJson;
+  private final FileIO io;
+  private final EncryptionManager encryption;
+  private final LocationProvider locationProvider;
+
+  private transient volatile Table lazyTable = null;
+  private transient volatile Schema lazySchema = null;
+  private transient volatile PartitionSpec lazySpec = null;
+  private transient volatile SortOrder lazySortOrder = null;
+
+  public SerializedTable(Table table) {
+    this.name = table.name();
+    this.location = table.location();
+    this.metadataFileLocation = metadataFileLocation(table);
+    this.properties = SerializableMap.copyOf(table.properties());
+    this.schemaAsJson = SchemaParser.toJson(table.schema());
+    this.specAsJson = PartitionSpecParser.toJson(table.spec());
+    this.sortOrderAsJson = SortOrderParser.toJson(table.sortOrder());
+    this.io = fileIO(table);
+    this.encryption = table.encryption();
+    this.locationProvider = table.locationProvider();
+  }
+
+  private String metadataFileLocation(Table table) {
+    if (table instanceof HasTableOperations) {
+      TableOperations ops = ((HasTableOperations) table).operations();
+      return ops.current().metadataFileLocation();
+    } else {
+      return null;
+    }
+  }
+
+  private FileIO fileIO(Table table) {

Review comment:
       I don't think that we should change the contract for dynamically loaded classes. If a user chooses to use Kryo and a dynamically loaded component, it is their responsibility to make sure the two are compatible. We just need to make sure that Iceberg-supplied classes work with Kryo.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2403: Core: Add SerializedTable and SerializedMetadataTable

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#discussion_r605325627



##########
File path: core/src/main/java/org/apache/iceberg/SerializedTable.java
##########
@@ -0,0 +1,304 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.encryption.EncryptionManager;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.LocationProvider;
+import org.apache.iceberg.util.SerializableMap;
+
+/**
+ * A read-only serializable table implementation that can be sent to other nodes.
+ * <p>
+ * While this class captures the metadata file location that can be used to load the full table
+ * metadata, it also serializes directly things like properties, schema, spec, sort order
+ * to avoid reading the metadata file from other nodes to access frequently needed metadata.
+ * <p>
+ * <em>Note:</em> loading the full metadata from a large number of nodes can overwhelm the storage.
+ */
+public class SerializedTable implements Table, Serializable {

Review comment:
       This is still not Kryo friendly due to `io`. Query engines will have to handle that separately.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2403: Core: Add SerializedTable and SerializedMetadataTable

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#discussion_r606438507



##########
File path: core/src/main/java/org/apache/iceberg/SerializedTable.java
##########
@@ -0,0 +1,312 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.encryption.EncryptionManager;
+import org.apache.iceberg.hadoop.HadoopFileIO;
+import org.apache.iceberg.hadoop.SerializedConfiguration;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.LocationProvider;
+import org.apache.iceberg.util.SerializableMap;
+
+/**
+ * A read-only serializable table implementation that can be sent to other nodes.
+ * <p>
+ * While this class captures the metadata file location that can be used to load the full table
+ * metadata, it also serializes directly things like properties, schema, spec, sort order
+ * to avoid reading the metadata file from other nodes to access frequently needed metadata.

Review comment:
       Yeah, let me adapt the old message slightly and add it back.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2403: Core: Add SerializedTable and SerializedMetadataTable

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#discussion_r606746084



##########
File path: core/src/main/java/org/apache/iceberg/SerializableTableFactory.java
##########
@@ -0,0 +1,383 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.List;
+import java.util.Map;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.iceberg.encryption.EncryptionManager;
+import org.apache.iceberg.hadoop.HadoopFileIO;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.LocationProvider;
+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+import org.apache.iceberg.util.SerializableMap;
+
+/**
+ * A factory to create serializable tables.
+ */
+public class SerializableTableFactory {

Review comment:
       I went with a factory instead of a util class so that I can have it in `org.apache.iceberg` and don't need to open up our base metadata table class. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2403: Core: Add SerializedTable and SerializedMetadataTable

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#discussion_r606417486



##########
File path: core/src/main/java/org/apache/iceberg/hadoop/SerializedConfiguration.java
##########
@@ -0,0 +1,57 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.hadoop;
+
+import java.io.Serializable;
+import java.util.Map;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+
+/**
+ * A class that captures the current state of a Hadoop configuration in a serializable manner.
+ * <p>
+ * <em>Note:</em> Unlike {@link SerializableConfiguration}, no subsequent changes to the underlying
+ * Hadoop configuration will be propagated to the captured state once an instance
+ * of {@link SerializedConfiguration} is constructed.
+ */
+public class SerializedConfiguration implements Serializable {

Review comment:
       This way reduced the size of the serialized Hadoop conf by 50% as we don't serialize the source of the conf.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2403: Core: Add SerializedTable and SerializedMetadataTable

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#discussion_r606746251



##########
File path: core/src/main/java/org/apache/iceberg/SerializedTable.java
##########
@@ -0,0 +1,344 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.List;
+import java.util.Map;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.iceberg.encryption.EncryptionManager;
+import org.apache.iceberg.hadoop.HadoopFileIO;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.LocationProvider;
+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+import org.apache.iceberg.util.SerializableMap;
+
+/**
+ * A read-only serializable table implementation that can be sent to other nodes.
+ * <p>
+ * While this class captures the metadata file location that can be used to load the full table
+ * metadata, it also serializes directly things like properties, schema, spec, sort order
+ * to avoid reading the metadata file from other nodes to access frequently needed metadata.
+ * <p>
+ * This implementation assumes the passed instances of {@link FileIO}, {@link EncryptionManager},
+ * {@link LocationProvider} are serializable. If you are serializing the table using a custom
+ * serialization framework like Kryo, those instances of {@link FileIO}, {@link EncryptionManager},
+ * {@link LocationProvider} must be supported by that particular serialization framework.
+ * <p>
+ * <em>Note:</em> loading the full metadata from a large number of nodes can overwhelm the storage.
+ */
+public class SerializedTable implements Table, Serializable {

Review comment:
       Changed. Resolving this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] stevenzwu commented on a change in pull request #2403: Core: Add SerializableTableFactory

Posted by GitBox <gi...@apache.org>.
stevenzwu commented on a change in pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#discussion_r606914243



##########
File path: spark/src/test/java/org/apache/iceberg/KryoHelpers.java
##########
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import com.esotericsoftware.kryo.Kryo;
+import com.esotericsoftware.kryo.io.Input;
+import com.esotericsoftware.kryo.io.Output;
+import java.io.ByteArrayInputStream;
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+import java.io.ObjectInputStream;
+import java.io.ObjectOutputStream;
+import org.apache.spark.SparkConf;
+import org.apache.spark.serializer.KryoSerializer;
+
+public class KryoHelpers {

Review comment:
       why are this and other test files in spark module?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] stevenzwu commented on a change in pull request #2403: Core: Add SerializableTableFactory

Posted by GitBox <gi...@apache.org>.
stevenzwu commented on a change in pull request #2403:
URL: https://github.com/apache/iceberg/pull/2403#discussion_r607205561



##########
File path: core/src/main/java/org/apache/iceberg/SerializableTableFactory.java
##########
@@ -0,0 +1,383 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.List;
+import java.util.Map;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.iceberg.encryption.EncryptionManager;
+import org.apache.iceberg.hadoop.HadoopFileIO;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.LocationProvider;
+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+import org.apache.iceberg.util.SerializableMap;
+
+/**
+ * A factory to create serializable tables.
+ */
+public class SerializableTableFactory {

Review comment:
       `SerializableTables.copyOf(table)` looks more accurate than `Factory` to me.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org