You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/06/30 14:39:35 UTC

[GitHub] [hudi] veenaypatil opened a new issue, #6014: [SUPPORT] High runtime for a batch in SparkWriteHelper stage

veenaypatil opened a new issue, #6014:
URL: https://github.com/apache/hudi/issues/6014

   **Describe the problem you faced**
   
   We are observing higher run times for a batch , it took 15hr plus to complete single batch, the subsequent batches are running fine. The dataset in question is not big. Attaching few screenshots for reference, GC times are less.
   hoodieConfigs for reference
   
   <img width="1780" alt="Screenshot 2022-06-29 at 10 04 10 PM" src="https://user-images.githubusercontent.com/52563354/176704158-598c2a7a-f090-4481-8d5c-40df8bff9235.png">
   <img width="1780" alt="Screenshot 2022-06-29 at 10 06 53 PM" src="https://user-images.githubusercontent.com/52563354/176704200-29ef41de-d0f0-49e9-82bd-aebfae4c0b5f.png">
   <img width="1780" alt="Screenshot 2022-06-29 at 10 08 11 PM" src="https://user-images.githubusercontent.com/52563354/176704211-3662aa4c-07d1-48d3-a757-ff2921729258.png">
   
   
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1.
   2.
   3.
   4.
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version : 0.10.1
   
   * Spark version : 3.0.3
   
   * Hive version : 3.1.2
   
   * Hadoop version : 3.2.2
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : NO
   
   
   **Additional context**
   
   Hudi Configs 
   
   ```
   hoodieConfigs:
     hoodie.datasource.write.operation: upsert
     hoodie.datasource.write.table.type: MERGE_ON_READ
     hoodie.datasource.write.partitionpath.field: ""
     hoodie.datasource.write.keygenerator.class: org.apache.hudi.keygen.NonpartitionedKeyGenerator
     hoodie.metrics.on: true
     hoodie.metrics.reporter.type: CLOUDWATCH
     hoodie.datasource.hive_sync.partition_extractor_class: org.apache.hudi.hive.NonPartitionedExtractor
     hoodie.parquet.max.file.size: 6110612736
     hoodie.compact.inline: true
     hoodie.clean.automatic: true
     hoodie.compact.inline.trigger.strategy: NUM_AND_TIME
     hoodie.clean.async: true
     hoodie.cleaner.policy: KEEP_LATEST_COMMITS
     hoodie.cleaner.commits.retained: 120
     hoodie.keep.min.commits: 130
     hoodie.keep.max.commits: 131
   ```
   
   Spark Job configs
   
   ```
   {
     "className": "com.hotstar.driver.CdcCombinedDriver",
     "proxyUser": "root",
     "driverCores": 1,
     "executorCores": 4,
     "executorMemory": "4G",
     "driverMemory": "4G",
     "queue": "cdc",
     "name": "hudiJob",
     "file": "s3a://bucket/jars/prod.jar",
     "conf": {
       "spark.eventLog.enabled": "false",
       "spark.ui.enabled": "true",
       "spark.streaming.concurrentJobs": "1",
       "spark.streaming.backpressure.enabled": "false",
       "spark.streaming.kafka.maxRatePerPartition": "500",
       "spark.yarn.am.nodeLabelExpression": "cdc",
       "spark.shuffle.service.enabled": "true",
       "spark.driver.maxResultSize": "8g",
       "spark.driver.memoryOverhead": "2048",
       "spark.executor.memoryOverhead": "2048",
       "spark.dynamicAllocation.enabled": "true",
       "spark.dynamicAllocation.minExecutors": "25",
       "spark.dynamicAllocation.maxExecutors": "50",
       "spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version": "2",
       "spark.jars.packages": "org.apache.spark:spark-avro_2.12:3.0.2,com.izettle:metrics-influxdb:1.2.3",
       "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
       "spark.rdd.compress": "true",
       "spark.sql.hive.convertMetastoreParquet": "false",
       "spark.yarn.maxAppAttempts": "1",
       "spark.task.cpus": "1"
     }
   }
   ```
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org