You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/07/01 07:00:52 UTC

[GitHub] [hudi] rohit-m-99 commented on issue #6015: [SUPPORT] Building workload profile failing after upgrade to 0.11.0

rohit-m-99 commented on issue #6015:
URL: https://github.com/apache/hudi/issues/6015#issuecomment-1172007450

   Was able to successfully run the job by 
   
   1. Downgrading from Spark 3.2.1 to 3.1.2
   2. Using hadoop version 3.2.0
   3. Using hudi-utilities bundle exclusively in the deltastreamer
   4. Exclusively using the insert operation
   
   ```
   #!/bin/bash
   spark-submit \
   --jars opt/spark/jars/hudi-utilities-bundle.jar,/opt/spark/jars/hadoop-aws.jar,/opt/spark/jars/aws-java-sdk.jar \
   --master spark://spark-master:7077 \
   --total-executor-cores 10 \
   --executor-memory 4g \
   --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
   --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer opt/spark/jars/hudi-utilities-bundle.jar \
   --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
   --target-table per_tick_stats \
   --table-type COPY_ON_WRITE \
   --min-sync-interval-seconds 30 \
   --source-limit 250000000 \
   --continuous \
   --source-ordering-field $3 \
   --target-base-path $2 \
   --hoodie-conf hoodie.deltastreamer.source.dfs.root=$1 \
   --hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator \
   --hoodie-conf hoodie.datasource.write.recordkey.field=$4 \
   --hoodie-conf hoodie.datasource.write.precombine.field=$3 \
   --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=$5 \
   --hoodie-conf hoodie.datasource.write.partitionpath.field=$6 \
   --hoodie-conf hoodie.clustering.inline=true \
   --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=100000000 \
   --hoodie-conf hoodie.clustering.inline.max.commits=4 \
   --hoodie-conf hoodie.metadata.enable=true \
   --hoodie-conf hoodie.metadata.index.column.stats.enable=true \
   --op INSERT
   ```
   
   However this results in successful ingestion but is still pretty slow. See following operation:
   
   Given the 250MB source limit seems like only inserting shouldn't be taking on the order of 12 minutes?
   
   <img width="1432" alt="image" src="https://user-images.githubusercontent.com/84733594/176841469-548ea774-5485-4b03-bc3a-1252519cf011.png">
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org