You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "haripriyarhp (via GitHub)" <gi...@apache.org> on 2023/03/17 13:11:25 UTC

[GitHub] [hudi] haripriyarhp opened a new issue, #8217: [SUPPORT] Async compaction & ingestion performance

haripriyarhp opened a new issue, #8217:
URL: https://github.com/apache/hudi/issues/8217

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   I have a spark structured streaming job writing to MoR in S3. Based on the docs here https://hudi.apache.org/docs/next/compaction#spark-structured-streaming , I have set the properties for async compaction. As per my understanding, the compaction runs asynchronously so that ingestion also takes place in parallel. But what I observed is that the spark assigns all the resources to the compaction job and only when it is finished, it continues with the ingestion even though both the jobs are running in parallel. Is there something like --delta-sync-scheduling-weight", "--compact-scheduling-weight", ""--delta-sync-scheduling-minshare", and "--compact-scheduling-minshare" for spark structured streaming for ingestion and compaction to run in parallel with resource allocation ?
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1.
   2.
   3.
   4.
   
   **Expected behavior**
   
   I expect compaction and ingestion is happening in parallel
   
   **Environment Description**
   
   * Hudi version : 0.13.0
   
   * Spark version : 3.1.2
   
   * Hive version :
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : on spark operator on k8s
   
   Here is the screenshot of the micro batches for the spark structured streaming job. Normally each batch takes 4-6 mins. But the batches immediately following compaction is taking some time. 
   ![async_compaction](https://user-images.githubusercontent.com/109664817/225903090-f2a8aaaf-aa34-49d6-aa59-c7127849832c.PNG)
   
   Time taken for a normal batch = 206
   ![sensor_morbloom_124_details](https://user-images.githubusercontent.com/109664817/225904984-b13864e8-ddb3-4bd9-8f14-7330d05956f6.PNG)
   
   Batch=207 
   Here you can see the last job 2961 is HoodieCompactionPlanGenerator which is adding 8.6 mins overhead to this batch.
   ![sensor_morbloom_125_details](https://user-images.githubusercontent.com/109664817/225903294-18ee13ad-bafc-4f57-a820-cf7f9eb6237a.PNG)
   
   Batch=208
   Here you can see it starts with job 2963 but 2964-2972 is for compaction and you can see it continues with 2973 only after compaction jobs are finished even though it runs on a different thread(see below pic). And time taken for different stages like Load base files, Building workload profile and getting small files has drastically increased. Refer batch 206 for normal time taken. 
   ![sensor_morbloom_126_details](https://user-images.githubusercontent.com/109664817/225903283-db13427d-3338-44e0-93a2-e614451b6dd7.PNG)
   
   compaction jobs (2964-2972)
   ![sensor_morbloom_async_compaction](https://user-images.githubusercontent.com/109664817/225909669-6a356796-4a5e-459c-a001-e20f30caeac2.PNG)
   
   Is there something that can be done to improve this performance? 
   
   **Additional context**
   
   using hudi-spark3.1-bundle_2.12-0.13.0.jar along with hadoop-aws_3.1.2.jar and aws-java-sdk-bundle_1.11.271.jar. 
   
   job configs
   
    "hoodie.table.name" -> tableName,
         "path" -> "s3a://path/Hudi/".concat(tableName),
         "hoodie.datasource.write.table.name" -> tableName,
         "hoodie.datasource.write.table.type" -> MERGE_ON_READ,
         "hoodie.datasource.write.operation" -> "upsert",
         "hoodie.datasource.write.recordkey.field" -> "col5,col6,col7",
         "hoodie.datasource.write.partitionpath.field" -> "col1,col2,col3,col4",
         "hoodie.datasource.write.keygenerator.class" -> "org.apache.hudi.keygen.ComplexKeyGenerator",
         "hoodie.datasource.write.hive_style_partitioning" -> "true",
         //Cleaning options
         "hoodie.clean.automatic" -> "true",
         "hoodie.clean.max.commits" -> "3",
         //"hoodie.clean.async" -> "true",
         //hive_sync_options
         "hoodie.datasource.hive_sync.partition_fields" -> "col1,col2,col3,col4",
         "hoodie.datasource.hive_sync.database" -> dbName,
         "hoodie.datasource.hive_sync.table" -> tableName,
         "hoodie.datasource.hive_sync.enable" -> "true",
         "hoodie.datasource.hive_sync.mode" -> "hms",
         "hoodie.datasource.hive_sync.partition_extractor_class" -> "org.apache.hudi.hive.MultiPartKeysValueExtractor",
         "hoodie.upsert.shuffle.parallelism" -> "200",
         "hoodie.insert.shuffle.parallelism" -> "200",
         "hoodie.datasource.compaction.async.enable" -> true, 
         "hoodie.compact.inline.max.delta.commits" -> "10",
         "hoodie.index.type" -> "BLOOM"
      
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] haripriyarhp commented on issue #8217: [SUPPORT] Async compaction & ingestion performance

Posted by "haripriyarhp (via GitHub)" <gi...@apache.org>.
haripriyarhp commented on issue #8217:
URL: https://github.com/apache/hudi/issues/8217#issuecomment-1482570661

   Update: As per the suggestion given here in slack forum https://apache-hudi.slack.com/archives/C4D716NPQ/p1679068718111579?thread_ts=1678987926.576359&cid=C4D716NPQ (https://medium.com/@simpsons/efficient-resource-allocation-for-async-table-services-in-hudi-124375d58dc), I tried setting  the spark_scheduler_allocation_file. Still I see not much improvement in performance even though it goes to seperate pools.
   The below job looking for files to compact goes to the sparkdatasourcewriter which is taking ~10 mins which increases the time of the streaming query
   ![Job1494](https://user-images.githubusercontent.com/109664817/227495049-d9b787a8-973f-4256-939f-a82a0c59d8fc.PNG)
   
   The other jobs go to the hoodiecompact pool
   ![Job1495](https://user-images.githubusercontent.com/109664817/227495382-61ec5f26-91a9-44d6-b166-e011663f8944.PNG)
   ![Job1497](https://user-images.githubusercontent.com/109664817/227495394-932126f3-eb9e-4db1-ad36-cc9a4c2f3f40.PNG)
   
   But still if you see the overall time taken for the queries remain as the same as above. 
   ![streamingqueries](https://user-images.githubusercontent.com/109664817/227495601-28b3da63-9478-4c1f-8d07-ce625afc7079.PNG)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] haripriyarhp commented on issue #8217: [SUPPORT] Async compaction & ingestion performance

Posted by "haripriyarhp (via GitHub)" <gi...@apache.org>.
haripriyarhp commented on issue #8217:
URL: https://github.com/apache/hudi/issues/8217#issuecomment-1497026240

   Hi, is there some insights on this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [SUPPORT] Async compaction & ingestion performance [hudi]

Posted by "abhisheksahani91 (via GitHub)" <gi...@apache.org>.
abhisheksahani91 commented on issue #8217:
URL: https://github.com/apache/hudi/issues/8217#issuecomment-1830910909

   @haripriyarhp Did you get any workaround for this? I have just onboarded the Hudi deltastreamer pipeline in production and I am also observing that during compaction no writes are happening. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org