You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/10/09 03:25:59 UTC

[GitHub] [iceberg] JonasJ-ap opened a new pull request, #5939: AWS: Fix NotSerializableException when using AssumeRoleAwsClientFactory

JonasJ-ap opened a new pull request, #5939:
URL: https://github.com/apache/iceberg/pull/5939

   Fix the `NotSerializableException` when using `AssumeRoleAwsClientFactory` to configure the `GlueCatalog` of a spark shell.
   
   Compiled iceberg-spark-runtime-3.1 and tested on Glue 3.0.
   The following Spark script triggers the exception before the fix and succeeds after the fix:
   ```
   import org.apache.spark.SparkContext
   import org.apache.spark.sql.SparkSession
   
   import org.apache.iceberg.Table
   import org.apache.iceberg.aws.glue.GlueCatalog
   import org.apache.iceberg.catalog.Catalog
   import org.apache.iceberg.aws.AssumeRoleAwsClientFactory
   import org.apache.iceberg.catalog.TableIdentifier
   import org.apache.iceberg.spark.actions.SparkActions
   
   import scala.jdk.CollectionConverters._
   
   object GlueApp {
       def main(sysArgs: Array[String]) {
           val sparkContext: SparkContext = new SparkContext()
           val spark: SparkSession = SparkSession.builder.
             config("spark.sql.catalog.demo", "org.apache.iceberg.spark.SparkCatalog").
             config("spark.sql.catalog.demo.warehouse", "s3://gluetestjonas/warehouse").
             config("spark.sql.catalog.demo.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog").
             config("spark.sql.catalog.demo.client.factory", "org.apache.iceberg.aws.AssumeRoleAwsClientFactory").
             config("spark.sql.catalog.demo.client.assume-role.arn", "arn:aws:iam::481640105715:role/jonasjiang_gluejob2").
             config("spark.sql.catalog.demo.client.assume-role.region", "us-east-1").
             config("spark.sql.catalog.demo.client.assume-role.session-name", "mytestname").
             getOrCreate()
             
           spark.sql("CREATE DATABASE IF NOT EXISTS demo.reviewsjonas")
           
           val book_reviews_location = "s3://amazon-reviews-pds/parquet/product_category=Books/*.parquet"
               val book_reviews = spark.read.parquet(book_reviews_location)
               book_reviews.writeTo("demo.reviewsjonas.book_reviews_session_name").
                 tableProperty("format-version", "2").
                 createOrReplace()
               
               // read using SQL
               // spark.sql("SELECT * FROM demo.reviews.book_reviews").show()
       }
       
    
   }
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] JonasJ-ap commented on a diff in pull request #5939: AWS: Fix NotSerializableException when using AssumeRoleAwsClientFactory in Spark

Posted by GitBox <gi...@apache.org>.
JonasJ-ap commented on code in PR #5939:
URL: https://github.com/apache/iceberg/pull/5939#discussion_r991864782


##########
aws/src/test/java/org/apache/iceberg/aws/TestAwsClientFactories.java:
##########
@@ -71,6 +73,53 @@ public void testS3FileIoCredentialsVerification() {
         () -> AwsClientFactories.from(properties));
   }
 
+  @Test
+  public void testDefaultAwsClientFactorySerializable() {
+    Map<String, String> properties = Maps.newHashMap();
+    AwsClientFactory defaultAwsClientFactory = AwsClientFactories.from(properties);
+    byte[] serializedFactoryBytes = SerializationUtil.serializeToBytes(defaultAwsClientFactory);
+    AwsClientFactory deserializedClientFactory =
+        SerializationUtil.deserializeFromBytes(serializedFactoryBytes);
+    Assert.assertTrue(
+        "DefaultAwsClientFactory should be serializable",
+        deserializedClientFactory instanceof AwsClientFactories.DefaultAwsClientFactory);
+  }
+
+  @Test
+  public void testAssumeRoleAwsClientFactorySerializable() {
+    Map<String, String> properties = Maps.newHashMap();
+    properties.put(AwsProperties.CLIENT_FACTORY, AssumeRoleAwsClientFactory.class.getName());
+    properties.put(AwsProperties.CLIENT_ASSUME_ROLE_ARN, "arn::test");
+    properties.put(AwsProperties.CLIENT_ASSUME_ROLE_REGION, "us-east-1");
+    AwsClientFactory assumeRoleAwsClientFactory = AwsClientFactories.from(properties);
+    byte[] serializedFactoryBytes = SerializationUtil.serializeToBytes(assumeRoleAwsClientFactory);
+    AwsClientFactory deserializedClientFactory =
+        SerializationUtil.deserializeFromBytes(serializedFactoryBytes);
+    Assert.assertTrue(
+        "AssumeRoleAwsClientFactory should be serializable",
+        deserializedClientFactory instanceof AssumeRoleAwsClientFactory);
+  }
+
+  @Test
+  public void testLakeFormationAwsClientFactorySerializable() {

Review Comment:
   
   > Can you plz add the complete stacktrace you see in Spark Serialization failure with java serializer
   
   Here is the stacktrace that I copied from AWS Glue Job's Error Log:
   ```
   2022-10-09 01:56:21,890 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(94)): Exception in User Class
   java.io.NotSerializableException: software.amazon.awssdk.services.sts.model.AssumeRoleRequest
   Serialization stack:
   	- object not serializable (class: software.amazon.awssdk.services.sts.model.AssumeRoleRequest, value: AssumeRoleRequest(RoleArn=arn:aws:iam::481640105715:role/jonasjiang_gluejob2, RoleSessionName=iceberg-aws-d9c0fac0-e938-4867-b8e3-f7f160f11a82, DurationSeconds=3600, Tags=[]))
   	- field (class: org.apache.iceberg.aws.AssumeRoleAwsClientFactory, name: assumeRoleRequest, type: class software.amazon.awssdk.services.sts.model.AssumeRoleRequest)
   	- object (class org.apache.iceberg.aws.AssumeRoleAwsClientFactory, org.apache.iceberg.aws.AssumeRoleAwsClientFactory@5a07088b)
   	- element of array (index: 0)
   	- array (class [Ljava.lang.Object;, size 1)
   	- field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;)
   	- object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class org.apache.iceberg.aws.s3.S3FileIO, functionalInterfaceMethod=org/apache/iceberg/util/SerializableSupplier.get:()Ljava/lang/Object;, implementation=invokeInterface org/apache/iceberg/aws/AwsClientFactory.s3:()Lsoftware/amazon/awssdk/services/s3/S3Client;, instantiatedMethodType=()Lsoftware/amazon/awssdk/services/s3/S3Client;, numCaptured=1])
   	- writeReplace data (class: java.lang.invoke.SerializedLambda)
   	- object (class org.apache.iceberg.aws.s3.S3FileIO$$Lambda$1017/1010480754, org.apache.iceberg.aws.s3.S3FileIO$$Lambda$1017/1010480754@11885841)
   	- field (class: org.apache.iceberg.aws.s3.S3FileIO, name: s3, type: interface org.apache.iceberg.util.SerializableSupplier)
   	- object (class org.apache.iceberg.aws.s3.S3FileIO, org.apache.iceberg.aws.s3.S3FileIO@64236be)
   	- field (class: org.apache.iceberg.SerializableTable, name: io, type: interface org.apache.iceberg.io.FileIO)
   	- object (class org.apache.iceberg.spark.source.SerializableTableWithSize, org.apache.iceberg.spark.source.SerializableTableWithSize@28d07fa4)
   	at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41)
   	at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
   	at org.apache.spark.broadcast.TorrentBroadcast$.$anonfun$blockifyObject$4(TorrentBroadcast.scala:319)
   	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
   	at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:321)
   	at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:138)
   	at org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:91)
   	at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:35)
   	at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:77)
   	at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1545)
   	at org.apache.spark.api.java.JavaSparkContext.broadcast(JavaSparkContext.scala:546)
   	at org.apache.iceberg.spark.source.SparkWrite.createWriterFactory(SparkWrite.java:164)
   	at org.apache.iceberg.spark.source.SparkWrite.access$700(SparkWrite.java:92)
   	at org.apache.iceberg.spark.source.SparkWrite$BaseBatchWrite.createBatchWriterFactory(SparkWrite.java:241)
   	at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2(WriteToDataSourceV2Exec.scala:348)
   	at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2$(WriteToDataSourceV2Exec.scala:336)
   	at org.apache.spark.sql.execution.datasources.v2.AtomicReplaceTableAsSelectExec.writeWithV2(WriteToDataSourceV2Exec.scala:178)
   	at org.apache.spark.sql.execution.datasources.v2.TableWriteExecHelper.$anonfun$writeToTable$1(WriteToDataSourceV2Exec.scala:476)
   	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1473)
   	at org.apache.spark.sql.execution.datasources.v2.TableWriteExecHelper.writeToTable(WriteToDataSourceV2Exec.scala:465)
   	at org.apache.spark.sql.execution.datasources.v2.TableWriteExecHelper.writeToTable$(WriteToDataSourceV2Exec.scala:460)
   	at org.apache.spark.sql.execution.datasources.v2.AtomicReplaceTableAsSelectExec.writeToTable(WriteToDataSourceV2Exec.scala:178)
   	at org.apache.spark.sql.execution.datasources.v2.AtomicReplaceTableAsSelectExec.run(WriteToDataSourceV2Exec.scala:209)
   	at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:40)
   	at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:40)
   	at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.doExecute(V2CommandExec.scala:55)
   	at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:185)
   	at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:223)
   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
   	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:220)
   	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:181)
   	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:134)
   	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:133)
   	at org.apache.spark.sql.DataFrameWriterV2.$anonfun$runCommand$1(DataFrameWriterV2.scala:196)
   	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
   	at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
   	at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110)
   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135)
   	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
   	at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135)
   	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253)
   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134)
   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
   	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68)
   	at org.apache.spark.sql.DataFrameWriterV2.runCommand(DataFrameWriterV2.scala:196)
   	at org.apache.spark.sql.DataFrameWriterV2.internalReplace(DataFrameWriterV2.scala:213)
   	at org.apache.spark.sql.DataFrameWriterV2.createOrReplace(DataFrameWriterV2.scala:133)
   	at GlueApp$.main(assumeroletest2.scala:30)
   	at GlueApp.main(assumeroletest2.scala)
   	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.lang.reflect.Method.invoke(Method.java:498)
   	at com.amazonaws.services.glue.SparkProcessLauncherPlugin.invoke(ProcessLauncher.scala:48)
   	at com.amazonaws.services.glue.SparkProcessLauncherPlugin.invoke$(ProcessLauncher.scala:48)
   	at com.amazonaws.services.glue.ProcessLauncher$$anon$1.invoke(ProcessLauncher.scala:78)
   	at com.amazonaws.services.glue.ProcessLauncher.launch(ProcessLauncher.scala:143)
   	at com.amazonaws.services.glue.ProcessLauncher$.main(ProcessLauncher.scala:30)
   	at com.amazonaws.services.glue.ProcessLauncher.main(ProcessLauncher.scala)
   
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 merged pull request #5939: AWS: Fix NotSerializableException when using AssumeRoleAwsClientFactory in Spark

Posted by GitBox <gi...@apache.org>.
jackye1995 merged PR #5939:
URL: https://github.com/apache/iceberg/pull/5939


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] JonasJ-ap commented on pull request #5939: AWS: Fix NotSerializableException when using AssumeRoleAwsClientFactory in Spark

Posted by GitBox <gi...@apache.org>.
JonasJ-ap commented on PR #5939:
URL: https://github.com/apache/iceberg/pull/5939#issuecomment-1272677978

   > Can we add a test to do a round trip serialization check to prevent this from happening again? You can use `SerializationUtil`
   
   Sure. Based on my understanding, all client factories should be serializable. Hence, I added three unit test to check the serializability of `DefaultAwsClientFactory`, `AssumeRoleAwsClientFactory`, and `LakeFormationAwsClientFactory`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on pull request #5939: AWS: Fix NotSerializableException when using AssumeRoleAwsClientFactory in Spark

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on PR #5939:
URL: https://github.com/apache/iceberg/pull/5939#issuecomment-1274917731

   Thanks for the fix @JonasJ-ap and thanks @singhpk234 for the review. Since this is a bug might impact quite a lot of users, I will go ahead to merge it instead of waiting for more reviews.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] JonasJ-ap commented on a diff in pull request #5939: AWS: Fix NotSerializableException when using AssumeRoleAwsClientFactory in Spark

Posted by GitBox <gi...@apache.org>.
JonasJ-ap commented on code in PR #5939:
URL: https://github.com/apache/iceberg/pull/5939#discussion_r991570121


##########
aws/src/test/java/org/apache/iceberg/aws/TestAwsClientFactories.java:
##########
@@ -71,6 +73,53 @@ public void testS3FileIoCredentialsVerification() {
         () -> AwsClientFactories.from(properties));
   }
 
+  @Test
+  public void testDefaultAwsClientFactorySerializable() {
+    Map<String, String> properties = Maps.newHashMap();
+    AwsClientFactory defaultAwsClientFactory = AwsClientFactories.from(properties);
+    byte[] serializedFactoryBytes = SerializationUtil.serializeToBytes(defaultAwsClientFactory);
+    AwsClientFactory deserializedClientFactory =
+        SerializationUtil.deserializeFromBytes(serializedFactoryBytes);
+    Assert.assertTrue(
+        "DefaultAwsClientFactory should be serializable",
+        deserializedClientFactory instanceof AwsClientFactories.DefaultAwsClientFactory);
+  }
+
+  @Test
+  public void testAssumeRoleAwsClientFactorySerializable() {
+    Map<String, String> properties = Maps.newHashMap();
+    properties.put(AwsProperties.CLIENT_FACTORY, AssumeRoleAwsClientFactory.class.getName());
+    properties.put(AwsProperties.CLIENT_ASSUME_ROLE_ARN, "arn::test");
+    properties.put(AwsProperties.CLIENT_ASSUME_ROLE_REGION, "us-east-1");
+    AwsClientFactory assumeRoleAwsClientFactory = AwsClientFactories.from(properties);
+    byte[] serializedFactoryBytes = SerializationUtil.serializeToBytes(assumeRoleAwsClientFactory);
+    AwsClientFactory deserializedClientFactory =
+        SerializationUtil.deserializeFromBytes(serializedFactoryBytes);
+    Assert.assertTrue(
+        "AssumeRoleAwsClientFactory should be serializable",
+        deserializedClientFactory instanceof AssumeRoleAwsClientFactory);
+  }
+
+  @Test
+  public void testLakeFormationAwsClientFactorySerializable() {

Review Comment:
   Thank you for your suggestions. I added the kryo round trip check for the three AwsClientFactories.
   
   After I added the Kryo check, I intentionally create a `AssumeRoleRequest` as an instance variable and initialize it in `initialize` method just like the implementation before the fix. I found that `NotSerializableException` will only be raised by the java serializer (The kryo serializer did not raise any exception and successfully passed the all the assertion). I wonder if this is a expected behavior given that we use a different serializer or there is something wrong in the way I build the test.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on pull request #5939: AWS: Fix NotSerializableException when using AssumeRoleAwsClientFactory

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on PR #5939:
URL: https://github.com/apache/iceberg/pull/5939#issuecomment-1272454957

   Thanks for the fix, I also overlooked the fact that the request is not serializable. Technically we can still store the assume role request, just need to not make it as `transient volatile` and initialize it if null. But it feels like an overkill given we probably need the request every 6 hours or so and it's okay to create it at runtime. So I agree this is the right way to fix. Let me know when this is out of draft status and I will review again!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] singhpk234 commented on a diff in pull request #5939: AWS: Fix NotSerializableException when using AssumeRoleAwsClientFactory in Spark

Posted by GitBox <gi...@apache.org>.
singhpk234 commented on code in PR #5939:
URL: https://github.com/apache/iceberg/pull/5939#discussion_r991828441


##########
aws/src/test/java/org/apache/iceberg/aws/TestAwsClientFactories.java:
##########
@@ -71,6 +73,53 @@ public void testS3FileIoCredentialsVerification() {
         () -> AwsClientFactories.from(properties));
   }
 
+  @Test
+  public void testDefaultAwsClientFactorySerializable() {
+    Map<String, String> properties = Maps.newHashMap();
+    AwsClientFactory defaultAwsClientFactory = AwsClientFactories.from(properties);
+    byte[] serializedFactoryBytes = SerializationUtil.serializeToBytes(defaultAwsClientFactory);
+    AwsClientFactory deserializedClientFactory =
+        SerializationUtil.deserializeFromBytes(serializedFactoryBytes);
+    Assert.assertTrue(
+        "DefaultAwsClientFactory should be serializable",
+        deserializedClientFactory instanceof AwsClientFactories.DefaultAwsClientFactory);
+  }
+
+  @Test
+  public void testAssumeRoleAwsClientFactorySerializable() {
+    Map<String, String> properties = Maps.newHashMap();
+    properties.put(AwsProperties.CLIENT_FACTORY, AssumeRoleAwsClientFactory.class.getName());
+    properties.put(AwsProperties.CLIENT_ASSUME_ROLE_ARN, "arn::test");
+    properties.put(AwsProperties.CLIENT_ASSUME_ROLE_REGION, "us-east-1");
+    AwsClientFactory assumeRoleAwsClientFactory = AwsClientFactories.from(properties);
+    byte[] serializedFactoryBytes = SerializationUtil.serializeToBytes(assumeRoleAwsClientFactory);
+    AwsClientFactory deserializedClientFactory =
+        SerializationUtil.deserializeFromBytes(serializedFactoryBytes);
+    Assert.assertTrue(
+        "AssumeRoleAwsClientFactory should be serializable",
+        deserializedClientFactory instanceof AssumeRoleAwsClientFactory);
+  }
+
+  @Test
+  public void testLakeFormationAwsClientFactorySerializable() {

Review Comment:
   Thank you for adding the test !
   
   > I wonder if this is a expected behavior given that we use a different serializer or there is something wrong in the way I build the test.
   
   A bit strange though, have mostly seen other way round, things that are serializable by Java serializer are not kryo serializable, I also tried it locally and AssumeRoleRequest is indeed kryo serializable.
   
   Can you plz add the complete stacktrace you see in Spark Serialization failure with java serializer



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a diff in pull request #5939: AWS: Fix NotSerializableException when using AssumeRoleAwsClientFactory in Spark

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on code in PR #5939:
URL: https://github.com/apache/iceberg/pull/5939#discussion_r992507927


##########
aws/src/test/java/org/apache/iceberg/aws/TestAwsClientFactories.java:
##########
@@ -71,6 +73,53 @@ public void testS3FileIoCredentialsVerification() {
         () -> AwsClientFactories.from(properties));
   }
 
+  @Test
+  public void testDefaultAwsClientFactorySerializable() {
+    Map<String, String> properties = Maps.newHashMap();
+    AwsClientFactory defaultAwsClientFactory = AwsClientFactories.from(properties);
+    byte[] serializedFactoryBytes = SerializationUtil.serializeToBytes(defaultAwsClientFactory);
+    AwsClientFactory deserializedClientFactory =
+        SerializationUtil.deserializeFromBytes(serializedFactoryBytes);
+    Assert.assertTrue(
+        "DefaultAwsClientFactory should be serializable",
+        deserializedClientFactory instanceof AwsClientFactories.DefaultAwsClientFactory);
+  }
+
+  @Test
+  public void testAssumeRoleAwsClientFactorySerializable() {
+    Map<String, String> properties = Maps.newHashMap();
+    properties.put(AwsProperties.CLIENT_FACTORY, AssumeRoleAwsClientFactory.class.getName());
+    properties.put(AwsProperties.CLIENT_ASSUME_ROLE_ARN, "arn::test");
+    properties.put(AwsProperties.CLIENT_ASSUME_ROLE_REGION, "us-east-1");
+    AwsClientFactory assumeRoleAwsClientFactory = AwsClientFactories.from(properties);
+    byte[] serializedFactoryBytes = SerializationUtil.serializeToBytes(assumeRoleAwsClientFactory);
+    AwsClientFactory deserializedClientFactory =
+        SerializationUtil.deserializeFromBytes(serializedFactoryBytes);
+    Assert.assertTrue(
+        "AssumeRoleAwsClientFactory should be serializable",
+        deserializedClientFactory instanceof AssumeRoleAwsClientFactory);
+  }
+
+  @Test
+  public void testLakeFormationAwsClientFactorySerializable() {

Review Comment:
   Turns out to be true that you don't need to implement `Serializable` to be Kyro serializable: https://stackoverflow.com/questions/38503563/does-kryo-serialization-work-on-non-serializable-class-and-class-have-non-serial
   
   And I checked the generated java class `AssumeRoleRequest` it does not implement `Serializable`. I was always under the assumption that it does since it's generated class, lesson learned orz



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] singhpk234 commented on a diff in pull request #5939: AWS: Fix NotSerializableException when using AssumeRoleAwsClientFactory in Spark

Posted by GitBox <gi...@apache.org>.
singhpk234 commented on code in PR #5939:
URL: https://github.com/apache/iceberg/pull/5939#discussion_r990949961


##########
aws/src/test/java/org/apache/iceberg/aws/TestAwsClientFactories.java:
##########
@@ -71,6 +73,53 @@ public void testS3FileIoCredentialsVerification() {
         () -> AwsClientFactories.from(properties));
   }
 
+  @Test
+  public void testDefaultAwsClientFactorySerializable() {
+    Map<String, String> properties = Maps.newHashMap();
+    AwsClientFactory defaultAwsClientFactory = AwsClientFactories.from(properties);
+    byte[] serializedFactoryBytes = SerializationUtil.serializeToBytes(defaultAwsClientFactory);
+    AwsClientFactory deserializedClientFactory =
+        SerializationUtil.deserializeFromBytes(serializedFactoryBytes);
+    Assert.assertTrue(
+        "DefaultAwsClientFactory should be serializable",
+        deserializedClientFactory instanceof AwsClientFactories.DefaultAwsClientFactory);
+  }
+
+  @Test
+  public void testAssumeRoleAwsClientFactorySerializable() {
+    Map<String, String> properties = Maps.newHashMap();
+    properties.put(AwsProperties.CLIENT_FACTORY, AssumeRoleAwsClientFactory.class.getName());
+    properties.put(AwsProperties.CLIENT_ASSUME_ROLE_ARN, "arn::test");
+    properties.put(AwsProperties.CLIENT_ASSUME_ROLE_REGION, "us-east-1");
+    AwsClientFactory assumeRoleAwsClientFactory = AwsClientFactories.from(properties);
+    byte[] serializedFactoryBytes = SerializationUtil.serializeToBytes(assumeRoleAwsClientFactory);
+    AwsClientFactory deserializedClientFactory =
+        SerializationUtil.deserializeFromBytes(serializedFactoryBytes);
+    Assert.assertTrue(
+        "AssumeRoleAwsClientFactory should be serializable",
+        deserializedClientFactory instanceof AssumeRoleAwsClientFactory);
+  }
+
+  @Test
+  public void testLakeFormationAwsClientFactorySerializable() {

Review Comment:
   should we also test this with kryoSerializer (spark users prefer it over java serializer) ? can use this util `TestHelpers.KryoHelpers.roundTripSerialize`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] singhpk234 commented on a diff in pull request #5939: AWS: Fix NotSerializableException when using AssumeRoleAwsClientFactory in Spark

Posted by GitBox <gi...@apache.org>.
singhpk234 commented on code in PR #5939:
URL: https://github.com/apache/iceberg/pull/5939#discussion_r992486634


##########
aws/src/test/java/org/apache/iceberg/aws/TestAwsClientFactories.java:
##########
@@ -71,6 +73,53 @@ public void testS3FileIoCredentialsVerification() {
         () -> AwsClientFactories.from(properties));
   }
 
+  @Test
+  public void testDefaultAwsClientFactorySerializable() {
+    Map<String, String> properties = Maps.newHashMap();
+    AwsClientFactory defaultAwsClientFactory = AwsClientFactories.from(properties);
+    byte[] serializedFactoryBytes = SerializationUtil.serializeToBytes(defaultAwsClientFactory);
+    AwsClientFactory deserializedClientFactory =
+        SerializationUtil.deserializeFromBytes(serializedFactoryBytes);
+    Assert.assertTrue(
+        "DefaultAwsClientFactory should be serializable",
+        deserializedClientFactory instanceof AwsClientFactories.DefaultAwsClientFactory);
+  }
+
+  @Test
+  public void testAssumeRoleAwsClientFactorySerializable() {
+    Map<String, String> properties = Maps.newHashMap();
+    properties.put(AwsProperties.CLIENT_FACTORY, AssumeRoleAwsClientFactory.class.getName());
+    properties.put(AwsProperties.CLIENT_ASSUME_ROLE_ARN, "arn::test");
+    properties.put(AwsProperties.CLIENT_ASSUME_ROLE_REGION, "us-east-1");
+    AwsClientFactory assumeRoleAwsClientFactory = AwsClientFactories.from(properties);
+    byte[] serializedFactoryBytes = SerializationUtil.serializeToBytes(assumeRoleAwsClientFactory);
+    AwsClientFactory deserializedClientFactory =
+        SerializationUtil.deserializeFromBytes(serializedFactoryBytes);
+    Assert.assertTrue(
+        "AssumeRoleAwsClientFactory should be serializable",
+        deserializedClientFactory instanceof AssumeRoleAwsClientFactory);
+  }
+
+  @Test
+  public void testLakeFormationAwsClientFactorySerializable() {

Review Comment:
   Thanks for sharing the stack trace
   
   > - object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class org.apache.iceberg.aws.s3.S3FileIO, functionalInterfaceMethod=org/apache/iceberg/util/SerializableSupplier.get:()Ljava/lang/Object;, implementation=invokeInterface org/apache/iceberg/aws/AwsClientFactory.s3:()Lsoftware/amazon/awssdk/services/s3/S3Client;, instantiatedMethodType=()Lsoftware/amazon/awssdk/services/s3/S3Client;, numCaptured=1])
   	- writeReplace data (class: java.lang.invoke.SerializedLambda)
   	- object (class org.apache.iceberg.aws.s3.S3FileIO$$Lambda$1017/1010480754, org.apache.iceberg.aws.s3.S3FileIO$$Lambda$1017/1010480754@11885841)
   	- field (class: org.apache.iceberg.aws.s3.S3FileIO, name: s3, type: interface org.apache.iceberg.util.SerializableSupplier)
   	- object (class org.apache.iceberg.aws.s3.S3FileIO, org.apache.iceberg.aws.s3.S3FileIO@64236be)
   
   As per above, S3FileIO becomes non-serializable as well, due to lambda holding reference to client-factory, we have UT's for S3FileIO serde with Default AWS factory but not with other client factories, but since we are Unit testing other client factories serde, we should be good here as well. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] JonasJ-ap commented on a diff in pull request #5939: AWS: Fix NotSerializableException when using AssumeRoleAwsClientFactory in Spark

Posted by GitBox <gi...@apache.org>.
JonasJ-ap commented on code in PR #5939:
URL: https://github.com/apache/iceberg/pull/5939#discussion_r991570121


##########
aws/src/test/java/org/apache/iceberg/aws/TestAwsClientFactories.java:
##########
@@ -71,6 +73,53 @@ public void testS3FileIoCredentialsVerification() {
         () -> AwsClientFactories.from(properties));
   }
 
+  @Test
+  public void testDefaultAwsClientFactorySerializable() {
+    Map<String, String> properties = Maps.newHashMap();
+    AwsClientFactory defaultAwsClientFactory = AwsClientFactories.from(properties);
+    byte[] serializedFactoryBytes = SerializationUtil.serializeToBytes(defaultAwsClientFactory);
+    AwsClientFactory deserializedClientFactory =
+        SerializationUtil.deserializeFromBytes(serializedFactoryBytes);
+    Assert.assertTrue(
+        "DefaultAwsClientFactory should be serializable",
+        deserializedClientFactory instanceof AwsClientFactories.DefaultAwsClientFactory);
+  }
+
+  @Test
+  public void testAssumeRoleAwsClientFactorySerializable() {
+    Map<String, String> properties = Maps.newHashMap();
+    properties.put(AwsProperties.CLIENT_FACTORY, AssumeRoleAwsClientFactory.class.getName());
+    properties.put(AwsProperties.CLIENT_ASSUME_ROLE_ARN, "arn::test");
+    properties.put(AwsProperties.CLIENT_ASSUME_ROLE_REGION, "us-east-1");
+    AwsClientFactory assumeRoleAwsClientFactory = AwsClientFactories.from(properties);
+    byte[] serializedFactoryBytes = SerializationUtil.serializeToBytes(assumeRoleAwsClientFactory);
+    AwsClientFactory deserializedClientFactory =
+        SerializationUtil.deserializeFromBytes(serializedFactoryBytes);
+    Assert.assertTrue(
+        "AssumeRoleAwsClientFactory should be serializable",
+        deserializedClientFactory instanceof AssumeRoleAwsClientFactory);
+  }
+
+  @Test
+  public void testLakeFormationAwsClientFactorySerializable() {

Review Comment:
   Thank you for your suggestions. I added the kryo round trip check for the three AwsClientFactories.
   
   After I added the Kryo check, I intentionally create a `AssumeRoleRequest` as an instance variable and initialize it in `initialize` method just like the implementation before the fix. I found that `NotSerializableException` would only be raised by the java serializer (The kryo serializer did not raise any exception and successfully passed the all the assertion). I wonder if this is a expected behavior given that we use a different serializer or there is something wrong in the way I build the test.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] JonasJ-ap commented on pull request #5939: AWS: Fix NotSerializableException when using AssumeRoleAwsClientFactory in Spark

Posted by GitBox <gi...@apache.org>.
JonasJ-ap commented on PR #5939:
URL: https://github.com/apache/iceberg/pull/5939#issuecomment-1272571569

   > Thanks for the fix, I also overlooked the fact that the request is not serializable. Technically we can still store the assume role request, just need to not make it as `transient volatile` and initialize it if null. But it feels like an overkill given we probably need the request every 6 hours or so and it's okay to create it at runtime. So I agree this is the right way to fix. Let me know when this is out of draft status and I will review again!
   
   @jackye1995 Thank you for your explanation. I marked this PR "Ready to be reviewed"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on pull request #5939: AWS: Fix NotSerializableException when using AssumeRoleAwsClientFactory in Spark

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on PR #5939:
URL: https://github.com/apache/iceberg/pull/5939#issuecomment-1272581963

   Can we add a test to do a round trip serialization check to prevent this from happening again? You can use `SerializationUtil`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org