You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/10/04 22:13:55 UTC

[GitHub] [hudi] Rap70r edited a comment on issue #3697: [SUPPORT] Performance Tuning: How to speed up stages?

Rap70r edited a comment on issue #3697:
URL: https://github.com/apache/hudi/issues/3697#issuecomment-933893616


   Hi @xushiyan,
   
   Here is an update for our latest tests. I have switched to d3.xlarge instance type and used the following configs:
   `spark-submit --deploy-mode cluster --conf spark.scheduler.mode=FAIR --conf spark.shuffle.service.enabled=true --conf spark.sql.hive.convertMetastoreParquet=false --conf spark.driver.maxResultSize=6g --conf spark.driver.memory=17g --conf spark.executor.cores=2 --conf spark.hadoop.parquet.enable.summary-metadata=false --conf spark.driver.memoryOverhead=6g --conf spark.network.timeout=600s --conf spark.executor.instances=50 --conf spark.executor.memoryOverhead=4g --conf spark.driver.cores=2 --conf spark.executor.memory=8g --conf spark.memory.storageFraction=0.1 --conf spark.executor.heartbeatInterval=120s --conf spark.memory.fraction=0.4 --conf spark.rdd.compress=true --conf spark.kryoserializer.buffer.max=200m --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.shuffle.partitions=200 --conf spark.default.parallelism=200 --conf spark.task.cpus=2`
   
   I also removed "spark.sql.parquet.mergeSchema".
   
   I have noticed a significant increase of speed for all the steps except the one that extracts events from Kafka. That step I can't seem to improve. We are using st1 high throughput ebs that is attached to the emr's master node. The topic is compacted and it contains ~50 million records across 50 partitions. Even with the above powerful instance it takes 40 minutes to extract all records.
   Basically, the part that is slow is the seeking part. It takes couple of minutes to seek from offset 50000 to 100000.
   Do you have any suggestions on how to improve data ingestion from kafka using spark structured streaming?
   
   Thank you


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org