You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/04/07 19:19:02 UTC

[GitHub] [hudi] l-jhon opened a new issue, #5254: [SUPPORT] - Hudi Delta Streamer not create table in Glue when using spark-submit deploy-mode cluster

l-jhon opened a new issue, #5254:
URL: https://github.com/apache/hudi/issues/5254

   **Describe the problem you faced**
   
   We are using Hudi Delta Streamer in our data ingestion pipeline, but we have a problem syncing Hudi with Glue metastore, and this happens after the version upgrade from 0.7.0 to 0.10.0. And another stranger thing that is happened is that when we submitted the spark-submit job using ```deploy-mode cluster``` the table isn't created in glue metastore, but if we use ```deploy-mode client``` the table is created successfully. So the problem with using the deploy-mode client is because the table is created in glue metastore and s3, but the job keeps running forever.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Executing spark-submit with Hudi DeltaStreamer
   
   ```
   spark-submit \
   --deploy-mode cluster \
   --jars s3://bucket/jars/hudi-spark-bundle_2.12-0.10.0.jar,s3://bucket/jars/spark-avro_2.12-3.0.1.jar \
   --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer s3://bucket/jars/hudi-utilities-bundle_2.12-0.10.0.jar \
   --op BULK_INSERT \
   --filter-dupes \
   --checkpoint 0 \
   --source-ordering-field updated_at \
   --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
   --table-type COPY_ON_WRITE \
   --target-base-path s3://bucket/silver_layer/table_name/ \
   --target-table datalake_silver.table_name \
   --enable-sync \
   --hoodie-conf hoodie.datasource.write.recordkey.field=id \
   --hoodie-conf hoodie.datasource.write.precombine.field=updated_at \
   --hoodie-conf hoodie.datasource.write.partitionpath.field=date_partition \
   --hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.SimpleKeyGenerator \
   --hoodie-conf hoodie.table.keygenerator.class=org.apache.hudi.keygen.SimpleKeyGenerator \
   --hoodie-conf hoodie.combine.before.insert=True \
   --hoodie-conf hoodie.deltastreamer.source.dfs.root=s3://bucket/bronze_layer/table_name/ \
   --hoodie-conf hoodie.datasource.hive_sync.enable=True \
   --hoodie-conf hoodie.datasource.hive_sync.database=datalake_silver \
   --hoodie-conf hoodie.datasource.hive_sync.table=table_name \
   --hoodie-conf hoodie.datasource.hive_sync.assume_date_partitioning=True \
   --hoodie-conf hoodie.datasource.hive_sync.partition_fields=date_partition \
   --hoodie-conf hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor \
   --hoodie-conf hoodie.datasource.hive_sync.support_timestamp=True \
   --hoodie-conf hoodie.consistency.check.enabled=True \
   --hoodie-conf hoodie.upsert.shuffle.parallelism=10 \
   --hoodie-conf hoodie.insert.shuffle.parallelism=10 \
   --hoodie-conf hoodie.bulkinsert.shuffle.parallelism=10 \
   --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
   --hoodie-conf hoodie.archive.automatic=True \
   --hoodie-conf hoodie.archive.merge.enable=True \
   --hoodie-conf hoodie.cleaner.commits.retained=2 \
   --hoodie-conf hoodie.clean.automatic=True \
   --hoodie-conf hoodie.clean.async=True \
   --hoodie-conf hoodie.parquet.max.file.size=1073741824 \
   --hoodie-conf hoodie.parquet.small.file.limit=0 \
   --hoodie-conf hoodie.parquet.compression.codec=snappy \
   --hoodie-conf hoodie.copyonwrite.insert.auto.split=True \ 
   --hoodie-conf hoodie.clustering.async.enabled=True \
   --hoodie-conf hoodie.clustering.async.max.commits=4 \
   --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=1073741824 \
   --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=629145600 \
   --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=date_partition \
   --hoodie-conf hoodie.clustering.execution.strategy.class=org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy \
   --transformer-class org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
   --hoodie-conf "hoodie.deltastreamer.transformer.sql=select id, code, cast(created_at as timestamp) as created_at,cast(updated_at as timestamp) as updated_at,date_partition from <SRC>"
   ```
   
   **Expected behavior**
   
   Job executed successfully, the table created in glue metastore and s3. Hudi version 0.10.0 is able to sync tables with Glue, news tables, and incremental tables to sync new partitions.
   
   **Environment Description**
   
   * Hudi version : 0.10.0
   
   * Spark version : 3.1.1
   
   * Hive version : 3.1.2
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   We are using AWS EMR version 6.3.0
   
   **Stacktrace**
   
   Unfortunately we don't have any errors to show, because the job ends normally without any problem, the problem is only when we use spark in the cluster in deploy mode, it doesn't create the table in glue, and when we use spark in the deploy -mode client it creates the table in glue, but the job is never done.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] l-jhon closed issue #5254: [SUPPORT] - Hudi Delta Streamer not create table in Glue when using spark-submit deploy-mode cluster

Posted by GitBox <gi...@apache.org>.

l-jhon closed issue #5254: [SUPPORT] - Hudi Delta Streamer not create table in Glue when using spark-submit deploy-mode cluster
URL: https://github.com/apache/hudi/issues/5254


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] l-jhon commented on issue #5254: [SUPPORT] - Hudi Delta Streamer not create table in Glue when using spark-submit deploy-mode cluster

Posted by GitBox <gi...@apache.org>.

l-jhon commented on issue #5254:
URL: https://github.com/apache/hudi/issues/5254#issuecomment-1098580619

   We solved this problem using clustering inline instead of async, now the data ingestion works without problems.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org