You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/07/22 12:51:10 UTC

[GitHub] [hudi] floriandaniel opened a new issue, #6188: [SUPPORT] Low performance with upserts on S3 storage

floriandaniel opened a new issue, #6188:
URL: https://github.com/apache/hudi/issues/6188

   **Problem**
   I'm testing the ability of Apache Hudi to make upserts faster than the current functions on Spark.
   Each record contains 40 fields. The partitioning key is country_iso (a string field). There are 200 different values for this field. The partitions are quite unbalanced (US, China have much records).
   The problem is that I'm getting very slow performance with small datasets (~1Gb)
   I'm updating a string field which is not the partitioning key and the record key.
   The ratio of updates in my upsert dataset : 100%. 
   
   This could come from the way of partitioning my Parquet file, the unbalanced partitioning, choose another partitioning key ...
   
   **Environment Description**
   
   * Hudi version : 0.11.1
   
   * Spark version : 3.1.2-amzn-1
   
   * Hive version :
   
   * Hadoop version : 3.2.1 (Amazon)
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   * AWS EMR : emr-6.5.0, 1 master (r5.xlarge), 2 cores (r5d.2xlarge)
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Hudi Config**
   
   ```
   hoodie.index.type = BLOOM/SIMPLE
   hoodie.bloom.index.prune.by.ranges = false
   hoodie.metadata.enable = true
   hoodie.enable.data.skipping = true
   hoodie.metadata.index.column.stats.enable = true
   hoodie.bloom.index.use.metadata = true
   ```
   
   | sample | src parquet  <br> (nb in millions) <br> (size in Gb) |  Updates  <br> (nb in millions)  <br> (size in Gb) | Upsert S3 - Simple index <br> (time in mins) | Upsert S3 - Bloom index <br> (time in mins) |
   |:----------:|:-------------:|:------:|:-------------:|:------:|
   | 1   |  8.7 M records <br> (0.9 Gb)  | 0.35 M records <br> (0.05 Gb) | 1.80   | 1.88   |
   | 10 |    87 M records <br> (7.9 Gb) | 3.5 M records <br> (0.55 Gb) <br>   | 10.5   | 21.5   |
   | 25 | 217M records <br> (18.7 Gb) | 8.7 M records <br> (1.1 Gb) <br>    | 27.05 | 110.5 |
   
   For example, the sample_10, I've got the following results :
   | index_type| 2 most costly tasks |
   |:----------:|:-------------:|
   |SIMPLE| <ul><li>Building workload profile: SIMPLE_hudi_sample_10 (countByKey at HoodieJavaPairRDD.java:104) -- 1.5 min</li><li>Doing partition and writing data: SIMPLE_hudi_sample_10 (count at HoodieSparkSqlWriter.scala:643) -- 8.1 min</li></ul>|
   |BLOOM| <ul><li>Building workload profile: BLOOM_hudi_sample_10 (countByKey at HoodieJavaPairRDD.java:104) -- 13 min -- **IMAGE 1**</li><li>Doing partition and writing data: BLOOM_hudi_sample_10 (count at HoodieSparkSqlWriter.scala:643) -- 8.0 min -- **IMAGE 2**</li></ul>|
   
   The image below show the partition /BN, with very small parquet files.
   ![partition_bn](https://user-images.githubusercontent.com/32508360/180441763-2b16f072-f15f-46ca-b81b-9495ce99f9e6.JPG)
   
   Here is the Spark trace of an upsert with Bloom index (sample_10)
   ![trace bloom sample 10](https://user-images.githubusercontent.com/32508360/180442010-de80e309-4dd5-4a16-b73e-1c9b6e619bca.JPG)
   
   **IMAGE 1**. Building workload profile: BLOOM_hudi_sample_10 (duration : 13 min), 
   ![spark 2](https://user-images.githubusercontent.com/32508360/180442419-a9b6caf7-34ea-4be0-86ca-812a1689bc19.JPG)
   
   **IMAGE 2**. Doing partition and writing data: BLOOM_hudi_sample_10, (duration : ~8mins) :
   ![spark executor](https://user-images.githubusercontent.com/32508360/180442252-7b664e65-9a27-495d-a878-0e28f63c6591.JPG)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on issue #6188: [SUPPORT] Low performance with upserts on S3 storage

Posted by GitBox <gi...@apache.org>.

alexeykudinkin commented on issue #6188:
URL: https://github.com/apache/hudi/issues/6188#issuecomment-1230826740

   @ns


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #6188: [SUPPORT] Low performance with upserts on S3 storage

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #6188:
URL: https://github.com/apache/hudi/issues/6188#issuecomment-1229342473

   @alexeykudinkin : can you take a look at this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on issue #6188: [SUPPORT] Low performance with upserts on S3 storage

Posted by GitBox <gi...@apache.org>.

alexeykudinkin commented on issue #6188:
URL: https://github.com/apache/hudi/issues/6188#issuecomment-1230832162

   Hey, @floriandaniel! Thanks for taking the time to file very detailed description.
   
   First of all i believe the crux of the problem is likely lying in the realms of using Bloom Index of the Metadata table: we've recently identified a performance gap in there and @yihua is currently working on addressing that (there's already a PR in progress). 
   
   Second, i'd recommend you to do following in your evaluation:
   
   1. Try Hudi 0.12 that has been recently released (we've done a lot of performance benchmarking/optimizations during last release cycle specifically to make sure Hudi's performance is top of the line)
   2. Disable `hoodie.bloom.index.use.metadata` for now (until above fix lands)
   3. Any particular reason you switching off `hoodie.bloom.index.prune.by.ranges`? It's very crucial aspect of using the Bloom Index that allows to prune the search space considerably for update-heavy workloads only checking the files that could contain the target records (and eliminating ones that couldn't)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org