You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/05/18 11:28:59 UTC

[GitHub] [iceberg] akshayar opened a new issue, #4804: Iceberg Streaming Job Can't run on Fargate on EMR on EKS

akshayar opened a new issue, #4804:
URL: https://github.com/apache/iceberg/issues/4804

   EMR Version : emr-6.5.0-latest 
   Iceberg Version : 
   https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark3-runtime/0.13.1/iceberg-spark3-runtime-0.13.1.jar
   https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark3-extensions/0.13.1/iceberg-spark3-extensions-0.13.1.jar 
   
   I am trying to run Iceberg streaming ingestion application which consumes from Kinesis Data Stream and ingests data to S3. When I run on EMR on EKS with EC2 nodes it work. However when I run on EMR on EKS with Fargate it fails with this error -
   
   Exception in thread "main" software.amazon.awssdk.core.exception.SdkClientException: Unable to load credentials from any of the providers in the chain AwsCredentialsProviderChain(credentialsProviders=[SystemPropertyCredentialsProvider(), EnvironmentVariableCredentialsProvider(), WebIdentityTokenCredentialsProvider(), ProfileCredentialsProvider(), ContainerCredentialsProvider(), InstanceProfileCredentialsProvider()]) : [SystemPropertyCredentialsProvider(): Unable to load credentials from system settings. Access key must be specified either via environment variable (AWS_ACCESS_KEY_ID) or system property (aws.accessKeyId)., EnvironmentVariableCredentialsProvider(): Unable to load credentials from system settings. Access key must be specified either via environment variable (AWS_ACCESS_KEY_ID) or system property (aws.accessKeyId)., WebIdentityTokenCredentialsProvider(): Multiple HTTP implementations were found on the classpath. To avoid non-deterministic loading implementations, pleas
 e explicitly provide an HTTP client via the client builders, set the software.amazon.awssdk.http.service.impl system property with the FQCN of the HTTP service to use as the default, or remove all but one HTTP implementation from the classpath, ProfileCredentialsProvider(): Profile file contained no credentials for profile 'default': ProfileFile(profiles=[]), ContainerCredentialsProvider(): Cannot fetch credentials from container - neither AWS_CONTAINER_CREDENTIALS_FULL_URI or AWS_CONTAINER_CREDENTIALS_RELATIVE_URI environment variables are set., InstanceProfileCredentialsProvider(): Unable to load credentials from service endpoint.]
   	at software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java:98)
   	at software.amazon.awssdk.auth.credentials.AwsCredentialsProviderChain.resolveCredentials(AwsCredentialsProviderChain.java:112)
   	at software.amazon.awssdk.auth.credentials.internal.LazyAwsCredentialsProvider.resolveCredentials(LazyAwsCredentialsProvider.java:45)
   	at software.amazon.awssdk.auth.credentials.DefaultCredentialsProvider.resolveCredentials(DefaultCredentialsProvider.java:104)
   	at software.amazon.awssdk.awscore.client.handler.AwsClientHandlerUtils.createExecutionContext(AwsClientHandlerUtils.java:76)
   	at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.createExecutionContext(AwsSyncClientHandler.java:68)
   	at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.lambda$execute$1(BaseSyncClientHandler.java:97)
   	at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.measureApiCallSuccess(BaseSyncClientHandler.java:167)
   	at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.execute(BaseSyncClientHandler.java:94)
   	at software.amazon.awssdk.core.client.handler.SdkSyncClientHandler.execute(SdkSyncClientHandler.java:45)
   	at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.execute(AwsSyncClientHandler.java:55)
   	at software.amazon.awssdk.services.glue.DefaultGlueClient.getTable(DefaultGlueClient.java:7220)
   	at org.apache.iceberg.aws.glue.GlueTableOperations.getGlueTable(GlueTableOperations.java:162)
   	at org.apache.iceberg.aws.glue.GlueTableOperations.doRefresh(GlueTableOperations.java:91)
   	at org.apache.iceberg.BaseMetastoreTableOperations.refresh(BaseMetastoreTableOperations.java:95)
   	at org.apache.iceberg.BaseMetastoreTableOperations.current(BaseMetastoreTableOperations.java:78)
   	at org.apache.iceberg.BaseMetastoreCatalog.loadTable(BaseMetastoreCatalog.java:42)
   	at org.apache.iceberg.shaded.com.github.benmanes.caffeine.cache.BoundedLocalCache.lambda$doComputeIfAbsent$14(BoundedLocalCache.java:2344)
   	at java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1853)
   	at org.apache.iceberg.shaded.com.github.benmanes.caffeine.cache.BoundedLocalCache.doComputeIfAbsent(BoundedLocalCache.java:2342)
   	at org.apache.iceberg.shaded.com.github.benmanes.caffeine.cache.BoundedLocalCache.computeIfAbsent(BoundedLocalCache.java:2325)
   	at org.apache.iceberg.shaded.com.github.benmanes.caffeine.cache.LocalCache.computeIfAbsent(LocalCache.java:108)
   	at org.apache.iceberg.shaded.com.github.benmanes.caffeine.cache.LocalManualCache.get(LocalManualCache.java:62)
   	at org.apache.iceberg.CachingCatalog.loadTable(CachingCatalog.java:161)
   	at org.apache.iceberg.spark.SparkCatalog.load(SparkCatalog.java:488)
   	at org.apache.iceberg.spark.SparkCatalog.loadTable(SparkCatalog.java:135)
   	at org.apache.iceberg.spark.SparkCatalog.loadTable(SparkCatalog.java:92)
   	at org.apache.spark.sql.connector.catalog.TableCatalog.tableExists(TableCatalog.java:119)
   
   
   The job  run details are -
   {
     "jobRun": {
       "id": "0000000306mnu3nsmd3",
       "name": "iceberg-job",
       "virtualClusterId": "bf8egc23bcgkw0ac9hitobvu0",
       "arn": "arn:aws:emr-containers:ap-south-1:ACCOUNT_ID:/virtualclusters/bf8egc23bcgkw0ac9hitobvu0/jobruns/0000000306mnu3nsmd3",
       "state": "FAILED",
       "clientToken": "73aa6347-cf3b-4a20-9898-97f108639b85",
       "executionRoleArn": "arn:aws:iam::ACCOUNT_ID:role/emr-on-eks-job-role",
       "releaseLabel": "emr-6.5.0-latest",
       "configurationOverrides": {
         "applicationConfiguration": [
           {
             "classification": "spark-defaults",
             "properties": {
               "spark.kubernetes.driver.label.type": "etl",
               "spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
               "spark.kubernetes.executor.label.type": "etl",
               "spark.sql.catalog.my_catalog.warehouse": "s3://s3-data-bucket/iceberg",
               "spark.sql.catalog.my_catalog.catalog-impl": "org.apache.iceberg.aws.glue.GlueCatalog",
               "spark.sql.catalog.my_catalog": "org.apache.iceberg.spark.SparkCatalog",
               "spark.sql.catalog.my_catalog.io-impl": "org.apache.iceberg.aws.s3.S3FileIO"
             }
           }
         ],
         "monitoringConfiguration": {
           "persistentAppUI": "ENABLED",
           "cloudWatchMonitoringConfiguration": {
             "logGroupName": "/emr-on-eks/eksworkshop-eksctl",
             "logStreamNamePrefix": "iceberg-job"
           },
           "s3MonitoringConfiguration": {
             "logUri": "s3://s3-data-bucket/hudi/logs/"
           }
         }
       },
       "jobDriver": {
         "sparkSubmitJobDriver": {
           "entryPoint": "s3://s3-data-bucket/spark-structured-streaming-kinesis-iceberg_2.12-1.0.jar",
           "entryPointArguments": [
             "s3-data-bucket",
             "data-stream-ingest-json",
             "ap-south-1",
             "my_catalog.demoiceberg.eks_fargate_iceberg_kinesis",
             "LATEST"
           ],
           "sparkSubmitParameters": "--class kinesis.iceberg.latefile.SparkKinesisConsumerIcebergProcessor --jars https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark3-runtime/0.13.1/iceberg-spark3-runtime-0.13.1.jar,https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark3-extensions/0.13.1/iceberg-spark3-extensions-0.13.1.jar,https://repo1.maven.org/maven2/org/apache/spark/spark-streaming-kinesis-asl_2.12/3.1.1/spark-streaming-kinesis-asl_2.12-3.1.1.jar,s3://'s3-data-bucket'/spark-sql-kinesis_2.12-1.2.1_spark-3.0-SNAPSHOT.jar,https://repo1.maven.org/maven2/software/amazon/awssdk/bundle/2.15.40/bundle-2.15.40.jar,https://repo1.maven.org/maven2/software/amazon/awssdk/url-connection-client/2.15.40/url-connection-client-2.15.40.jar"
         }
       },
       "createdAt": "2022-05-18T10:38:37+00:00",
       "createdBy": "arn:aws:iam::ACCOUNT_ID:user/username",
       "finishedAt": "2022-05-18T10:42:15+00:00",
       "stateDetails": "Jobrun failed. Main Spark container terminated with errors. Please refer logs uploaded to S3/CloudWatch based on your monitoring configuration.",
       "failureReason": "USER_ERROR",
       "tags": {}
     }
   }
   
   
   The code in scale writes using following lines -
   val query = (jsonDF.writeStream
         .format("iceberg")
   //      .format("console")
         .outputMode("append")
         .trigger(Trigger.ProcessingTime(1, TimeUnit.MINUTES))
         .option("path", tableName)
         .option("fanout-enabled", "true")
         .option("checkpointLocation", checkpoint_path)
         .start())
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] akshayar commented on issue #4804: Iceberg Streaming Job Can't run on Fargate on EMR on EKS

Posted by GitBox <gi...@apache.org>.
akshayar commented on issue #4804:
URL: https://github.com/apache/iceberg/issues/4804#issuecomment-1140940114

   This doesn't seem to be an Iceberg issue . I get this when i run a Fargate task as well. Closing the issue. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rajarshisarkar commented on issue #4804: Iceberg Streaming Job Can't run on Fargate on EMR on EKS

Posted by GitBox <gi...@apache.org>.
rajarshisarkar commented on issue #4804:
URL: https://github.com/apache/iceberg/issues/4804#issuecomment-1131501318

   Seems like there is an issue to load the credentials. Is the AWS SDK v2 library in the classpath? Possibly the `--jars` path is breaking at one place before the AWS SDK v2 library due to `'s3-data-bucket'`:
   
   ```
   "sparkSubmitParameters": "--jars .....,s3://'s3-data-bucket'/spark-sql-kinesis_2.12-1.2.1_spark-3.0-SNAPSHOT.jar,https://repo1.maven.org/maven2/software/amazon/awssdk/bundle/2.15.40/bundle-2.15.40.jar,https://repo1.maven.org/maven2/software/amazon/awssdk/url-connection-client/2.15.40/url-connection-client-2.15.40.jar"
   ```
   
   If it is in the classpath, can you please explicitly try providing the access key specified either via environment variable (`AWS_ACCESS_KEY_ID`) or system property (`aws.accessKeyId`) for testing purpose.
   
   Also, I couldn't find the Dynamo DB lock manager Spark configuration: `--conf spark.sql.catalog.my_catalog.lock-impl=org.apache.iceberg.aws.glue.DynamoLockManager`. Refer: [DynamoDB for Commit Locking](https://iceberg.apache.org/docs/latest/aws/#dynamodb-for-commit-locking)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rajarshisarkar commented on issue #4804: Iceberg Streaming Job Can't run on Fargate on EMR on EKS

Posted by GitBox <gi...@apache.org>.
rajarshisarkar commented on issue #4804:
URL: https://github.com/apache/iceberg/issues/4804#issuecomment-1141096793

   Thanks, for the confirmation @akshayar 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rajarshisarkar commented on issue #4804: Iceberg Streaming Job Can't run on Fargate on EMR on EKS

Posted by GitBox <gi...@apache.org>.
rajarshisarkar commented on issue #4804:
URL: https://github.com/apache/iceberg/issues/4804#issuecomment-1135772459

   Can you please check the following:
   1. Is the AWS SDK v2 library in the classpath with no conflict? If it is in the classpath, can you please explicitly try providing the access key specified either via environment variable (`AWS_ACCESS_KEY_ID`) or system property (`aws.accessKeyId`) for testing purpose.
   2. Does a normal Spark (non-Iceberg) job runs properly?
   
   Also, can you please provide the reproduction steps so that I can replicate this issue at my end.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on issue #4804: Iceberg Streaming Job Can't run on Fargate on EMR on EKS

Posted by GitBox <gi...@apache.org>.
kbendick commented on issue #4804:
URL: https://github.com/apache/iceberg/issues/4804#issuecomment-1136459562

   I don't know if this has anything to do with it, but I noticed the jars you're using are both `iceberg-spark3-runtime` as well as `iceberg-spark3-extensions`.
   
   First: You don't need the `extensions` jar. Only the `runtime` jar.
   Second: `iceberg-spark3-runtime` is only for Spark 3.0.x. For Spark 3.1 or 3.2, you should be using `iceberg-spark-runtime-3.1_2.12:0.13.1` or `iceberg-spark-runtime-3.2_2.12` (assuming your scala versions is 2.12). If you are using Spark 3.0.x, then that's the correct runtime jar.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] akshayar closed issue #4804: Iceberg Streaming Job Can't run on Fargate on EMR on EKS

Posted by GitBox <gi...@apache.org>.
akshayar closed issue #4804: Iceberg Streaming Job Can't run on Fargate on EMR on EKS
URL: https://github.com/apache/iceberg/issues/4804


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] akshayar commented on issue #4804: Iceberg Streaming Job Can't run on Fargate on EMR on EKS

Posted by GitBox <gi...@apache.org>.
akshayar commented on issue #4804:
URL: https://github.com/apache/iceberg/issues/4804#issuecomment-1135432202

   Yes, the issue is with credential. However what puzzles me is that it works fine for EMR on EKS when the job runs on EC2. While when I run on Fargate nodes it fails. 
   For Fargate run what what should I do ?
   I am not using --conf spark.sql.catalog.my_catalog.lock-impl=org.apache.iceberg.aws.glue.DynamoLockManager as I am not doing concurrent writes. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] akshayar commented on issue #4804: Iceberg Streaming Job Can't run on Fargate on EMR on EKS

Posted by GitBox <gi...@apache.org>.
akshayar commented on issue #4804:
URL: https://github.com/apache/iceberg/issues/4804#issuecomment-1138645962

   > 1. aws.accessKeyId
   
   1. When I pass the aws.accessKeyId and aws.secretAccessKey then it works. 
   2. Non Iceberg program works.  For example HUDI streaming job works without passing access key and secret access key. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org