You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/04/21 18:30:12 UTC

[GitHub] [hudi] JosefinaArayaTapia opened a new issue, #5389: [SUPPORT]

JosefinaArayaTapia opened a new issue, #5389:
URL: https://github.com/apache/hudi/issues/5389

   When loading a hudi table in the AWS Glue data catalog and then sending a data update via Spark, when reading the table again from spark, the history of the data appears and not just one.
   
   How can I solve that it brings me the last updated data.
   
   Steps to reproduce the behavior:
   
   1. Load Data:
   
   `hudiOptions = {
   'hoodie.table.name': table_name,
   'hoodie.datasource.write.recordkey.field': 'id',
   'hoodie.datasource.write.partitionpath.field': 'period',
   'hoodie.datasource.write.precombine.field': 'last_update_time',
   'hoodie.datasource.write.table.type':'COPY_ON_WRITE',
   'hoodie.datasource.hive_sync.enable': 'true',
   'hoodie.datasource.hive_sync.database':database_name,
   'hoodie.datasource.hive_sync.table': table_name,
   'hoodie.datasource.hive_sync.partition_fields': 'period',
   'hoodie.datasource.hive_sync.support_timestamp': 'true'
   }
   `
   
   `CLIENT.write \
   .format('org.apache.hudi') \
   .option('hoodie.datasource.write.operation', 'insert') \
   .options(**hudiOptions) \
   .mode('overwrite') \
   .save('s3a://'+bucket_name+'/'+table_name)`
   
   2. read and update one row in table hudi
   
   `client= spark.sql("select * from table_name where id=59")
   `
   `updateDF = client.withColumn("cod_estado", when(client.cod_estado.isNull(), lit('1')).otherwise(lit(None)))
   `
   
   `updateDF.write.format('org.apache.hudi').option('hoodie.datasource.write.operation', 'upsert').options(**hudiOptions) .mode('append') .save('s3a://'+bucket_name+'/'+table_name)`
       
   3. Query in Athena ---> OK
   
   
   ![Hudi_1](https://user-images.githubusercontent.com/3217621/164527257-df882973-555a-4a86-bf21-08338696cbf5.png)
   
   
   4. Read Parquet ---> OK
   
   ![Hudi_2](https://user-images.githubusercontent.com/3217621/164527611-f4eb1fd4-5279-445f-81ef-0aa06aeb63aa.png)
   
   5. Query in EMR from Catalog Glue --> NOK
   
   ![Hudi_3](https://user-images.githubusercontent.com/3217621/164527902-0d7c4068-ffda-4264-9148-6381a018327f.png)
   
   
   
   **Expected behavior**
   
   Query in EMR from Catalog only show the last data.
   
   **Environment Description**
   * EMR: emr-6.4.0
   
   * Hudi version : 0.8.0-amzn-0
   
   * Spark version :3.1.2
   
   * Hive version :3.1.2
   
   * Hadoop version : 3.2.1
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] xushiyan commented on issue #5389: [SUPPORT] - AWS EMR and Glue Catalog

Posted by GitBox <gi...@apache.org>.

xushiyan commented on issue #5389:
URL: https://github.com/apache/hudi/issues/5389#issuecomment-1130084315

   @JosefinaArayaTapia have you filed aws support case? this is for aws hudi and aws-specific environment, should be troubleshooted with aws support team


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] JosefinaArayaTapia commented on issue #5389: [SUPPORT] - AWS EMR and Glue Catalog

Posted by GitBox <gi...@apache.org>.

JosefinaArayaTapia commented on issue #5389:
URL: https://github.com/apache/hudi/issues/5389#issuecomment-1130096584

   Hi @xushiyan 
   
   I have presented the case to aws support and they sent me the following configuration which solved my problem.
   Also now use EMR 6.4.0
   
   
   ```
   #NewOptions - Change here is used ComplexKeyGenerator instead of SImpleKeyGenerator, and used more than one column in recordkeyfield
   
   hudiOptions = {
   'hoodie.datasource.write.precombine.field':'last_update_time',
   'hoodie.datasource.write.recordkey.field': 'id,creation_date', 
   'hoodie.table.name': 'newhuditest0439', 
   'hoodie.datasource.hive_sync.mode':'hms', 
   'hoodie.datasource.write.hive_style_partitioning':'true', 
   'hoodie.compact.inline.max.delta.commits':1, 
   'hoodie.compact.inline.trigger.strategy':'NUM_COMMITS', 
   'hoodie.datasource.compaction.async.enable':'false', 
   'hoodie.datasource.write.table.type':'COPY_ON_WRITE', 
   'hoodie.index.type':'GLOBAL_BLOOM', 
   'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor', 
   'hoodie.datasource.write.keygenerator.class':'org.apache.hudi.keygen.ComplexKeyGenerator', 
   'hoodie.bloom.index.filter.type':'DYNAMIC_V0', 
   'hoodie.bloom.index.update.partition.path': 'false', 
   'hoodie.datasource.hive_sync.table':'newhuditest0439', 
   'hoodie.datasource.hive_sync.enable':'true', 
   'hoodie.datasource.write.partitionpath.field':'creation_date', 
   'hoodie.datasource.hive_sync.partition_fields':'creation_date', 
   'hoodie.datasource.hive_sync.database':'default', 
   'hoodie.datasource.hive_sync.support_timestamp': 'true'
   } 
   
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] Gatsby-Lee commented on issue #5389: [SUPPORT] - AWS EMR and Glue Catalog

Posted by GitBox <gi...@apache.org>.

Gatsby-Lee commented on issue #5389:
URL: https://github.com/apache/hudi/issues/5389#issuecomment-1234881102

   @JosefinaArayaTapia 
   can you share which deployment you're using?
   * EMR on EC2?
   * EMR on EKS?
   * EMR Serverless?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] JosefinaArayaTapia commented on issue #5389: [SUPPORT] - AWS EMR and Glue Catalog

Posted by GitBox <gi...@apache.org>.

JosefinaArayaTapia commented on issue #5389:
URL: https://github.com/apache/hudi/issues/5389#issuecomment-1109905226

   Hi @codope 
   
   Here is the table creation:
   ```
   CREATE EXTERNAL TABLE 'bda.cliente'(
     '_hoodie_commit_time' string, 
     '_hoodie_commit_seqno' string, 
     '_hoodie_record_key' string, 
     '_hoodie_partition_path' string, 
     '_hoodie_file_name' string, 
     'id' decimal(12,0), 
     'last_update_time' timestamp, 
     'cod_estado' string
   )
   PARTITIONED BY ( 
     'periodo' int)
   ROW FORMAT SERDE 
     'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
   STORED AS INPUTFORMAT 
     'org.apache.hudi.hadoop.HoodieParquetInputFormat' 
   OUTPUTFORMAT 
     'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
   LOCATION
     's3a://bck/HUDI/CLIENTE'
   TBLPROPERTIES (
     'bucketing_version'='2', 
     'last_commit_time_sync'='20220425221015', 
     'last_modified_by'='hive', 
     'last_modified_time'='1650924207')
   ```
   
   -----
   
   In EMR 6.5 with Hudi 0.9.0 , i have the following problem:
   
   Configuration and call database:
   ```
   %%configure -f
   { 
       "conf": {
               "spark.jars":"hdfs:///apps/hudi/lib/hudi-spark-bundle.jar,hdfs:///apps/hudi/lib/spark-avro.jar,hdfs:///apps/hudi/lib/aws-java-sdk-bundle-1.12.31.jar",
               "spark.serializer":"org.apache.spark.serializer.KryoSerializer",
               "spark.sql.extensions":"org.apache.spark.sql.hudi.HoodieSparkSessionExtension"
   }}
   
   
   spark.sql("use bda")
   
   ```
   ```
   An error was encountered:
   An error occurred while calling o86.sql.
   : java.lang.NoSuchMethodError: com.amazonaws.transform.JsonUnmarshallerContext.getCurrentToken()Lcom/amazonaws/thirdparty/jackson/core/JsonToken;
   	at com.amazonaws.services.glue.model.transform.GetDatabaseResultJsonUnmarshaller.unmarshall(GetDatabaseResultJsonUnmarshaller.java:39)
   	at com.amazonaws.services.glue.model.transform.GetDatabaseResultJsonUnmarshaller.unmarshall(GetDatabaseResultJsonUnmarshaller.java:29)
   	at com.amazonaws.http.JsonResponseHandler.handle(JsonResponseHandler.java:118)
   	at com.amazonaws.http.JsonResponseHandler.handle(JsonResponseHandler.java:43)
   	at com.amazonaws.http.response.AwsResponseHandlerAdapter.handle(AwsResponseHandlerAdapter.java:69)
   	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleResponse(AmazonHttpClient.java:1734)
   	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleSuccessResponse(AmazonHttpClient.java:1454)
   	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1369)
   	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1145)
   	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:802)
   	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:770)
   	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:744)
   	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:704)
   	at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:686)
   	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:550)
   	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:530)
   	at com.amazonaws.services.glue.AWSGlueClient.doInvoke(AWSGlueClient.java:10640)
   	at com.amazonaws.services.glue.AWSGlueClient.invoke(AWSGlueClient.java:10607)
   	at com.amazonaws.services.glue.AWSGlueClient.invoke(AWSGlueClient.java:10596)
   	at com.amazonaws.services.glue.AWSGlueClient.executeGetDatabase(AWSGlueClient.java:4466)
   	at com.amazonaws.services.glue.AWSGlueClient.getDatabase(AWSGlueClient.java:4435)
   	at com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.doesDefaultDBExist(AWSCatalogMetastoreClient.java:238)
   	at com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.<init>(AWSCatalogMetastoreClient.java:151)
   	at com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory.createMetaStoreClient(AWSGlueDataCatalogHiveClientFactory.java:20)
   	at org.apache.hadoop.hive.ql.metadata.HiveUtils.createMetaStoreClient(HiveUtils.java:507)
   	at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3746)
   	at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3726)
   	at org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3988)
   	at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:251)
   	at org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:234)
   	at org.apache.hadoop.hive.ql.metadata.Hive.<init>(Hive.java:402)
   	at org.apache.hadoop.hive.ql.metadata.Hive.create(Hive.java:335)
   	at org.apache.hadoop.hive.ql.metadata.Hive.getInternal(Hive.java:315)
   	at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:291)
   	at org.apache.spark.sql.hive.client.HiveClientImpl.client(HiveClientImpl.scala:257)
   	at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:283)
   	at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:224)
   	at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:223)
   	at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:273)
   	at org.apache.spark.sql.hive.client.HiveClientImpl.databaseExists(HiveClientImpl.scala:384)
   	at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$databaseExists$1(HiveExternalCatalog.scala:249)
   	at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
   	at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:105)
   	at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:249)
   	at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:135)
   	at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:125)
   	at org.apache.spark.sql.internal.SharedState.isDatabaseExistent$1(SharedState.scala:169)
   	at org.apache.spark.sql.internal.SharedState.globalTempViewManager$lzycompute(SharedState.scala:201)
   	at org.apache.spark.sql.internal.SharedState.globalTempViewManager(SharedState.scala:153)
   	at org.apache.spark.sql.hive.HiveSessionStateBuilder.$anonfun$catalog$2(HiveSessionStateBuilder.scala:52)
   	at org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager$lzycompute(SessionCatalog.scala:99)
   	at org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager(SessionCatalog.scala:99)
   	at org.apache.spark.sql.catalyst.catalog.SessionCatalog.setCurrentDatabase(SessionCatalog.scala:281)
   	at org.apache.spark.sql.connector.catalog.CatalogManager.setCurrentNamespace(CatalogManager.scala:104)
   	at org.apache.spark.sql.execution.datasources.v2.SetCatalogAndNamespaceExec.$anonfun$run$2(SetCatalogAndNamespaceExec.scala:36)
   	at org.apache.spark.sql.execution.datasources.v2.SetCatalogAndNamespaceExec.$anonfun$run$2$adapted(SetCatalogAndNamespaceExec.scala:36)
   	at scala.Option.foreach(Option.scala:407)
   	at org.apache.spark.sql.execution.datasources.v2.SetCatalogAndNamespaceExec.run(SetCatalogAndNamespaceExec.scala:36)
   	at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:40)
   	at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:40)
   	at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:46)
   	at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:230)
   	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3751)
   	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
   	at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
   	at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110)
   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135)
   	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
   	at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135)
   	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253)
   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134)
   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
   	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68)
   	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3749)
   	at org.apache.spark.sql.Dataset.<init>(Dataset.scala:230)
   	at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:101)
   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
   	at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:98)
   	at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:618)
   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
   	at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:613)
   	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.lang.reflect.Method.invoke(Method.java:498)
   	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
   	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
   	at py4j.Gateway.invoke(Gateway.java:282)
   	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
   	at py4j.commands.CallCommand.execute(CallCommand.java:79)
   	at py4j.GatewayConnection.run(GatewayConnection.java:238)
   	at java.lang.Thread.run(Thread.java:750)
   
   Traceback (most recent call last):
     File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 723, in sql
       return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
     File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
       answer, self.gateway_client, self.target_id, self.name)
     File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco
       return f(*a, **kw)
     File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
       format(target_id, ".", name), value)
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] xushiyan closed issue #5389: [SUPPORT] - AWS EMR and Glue Catalog

Posted by GitBox <gi...@apache.org>.

xushiyan closed issue #5389: [SUPPORT] - AWS EMR and Glue Catalog
URL: https://github.com/apache/hudi/issues/5389


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] codope commented on issue #5389: [SUPPORT] - AWS EMR and Glue Catalog

Posted by GitBox <gi...@apache.org>.

codope commented on issue #5389:
URL: https://github.com/apache/hudi/issues/5389#issuecomment-1109750585

   This needs to be reproduced. As far as I recall, we have not seen this issue with COW table before. It always returns the latest snapshot. 
   1. Can you please run `SHOW CREATE TABLE <table_name>` in Glue? This should show create table statement with all the properties. My hunch is the sync to glue catalog  did not happen correctly.
   2. If possible, can you please try the same thing on EMR 6.5 which has Hudi 0.9.0?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] Gatsby-Lee commented on issue #5389: [SUPPORT] - AWS EMR and Glue Catalog

Posted by GitBox <gi...@apache.org>.

Gatsby-Lee commented on issue #5389:
URL: https://github.com/apache/hudi/issues/5389#issuecomment-1235104441

   For anyone who get here,
   
   if you have this issue, then you can find what you need from this link.
   https://aws.github.io/aws-emr-containers-best-practices/metastore-integrations/docs/aws-glue/#sync-hudi-table-with-aws-glue-catalog
   
   I tested in EMR on EKS ( emr 6.7 ) + Hudi 0.10.1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org