You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/10/17 19:16:56 UTC

[GitHub] [hudi] ft-bazookanu opened a new issue, #6970: [SUPPORT] Performance of Snapshot Exporter

ft-bazookanu opened a new issue, #6970:
URL: https://github.com/apache/hudi/issues/6970

   Increasing spark.executor.memory or spark.executor.cores _worsens_ performance of HUDI Exporter
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Run the HUDI exporter varying spark.executor.instances, spark.executor.memory and spark.executor.cores
   ![image](https://user-images.githubusercontent.com/107943394/196262534-60be19aa-b161-4382-a920-fe0886311377.png)
   
   
   **Expected behavior**
   1. Performance should not worsen if we increase spark.executor.memory and spark.executor.cores while keeping spark.executor.instances constant.
   
   We also hoped to have better performance in general, on par with `s3 cp`. What can we do to improve Exporter's performance?
   
   **Environment Description**
   
   * Hudi version : 0.10.1
   
   * Spark version : 3.1.2
   
   * Hive version : 3.1.3
   
   * Hadoop version : 3.2.1
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : yes
   
   
   **Additional context**
   
    - The total size of the exported data is 200GB.
    - The HUDI table has 500 partitions.
    - .hoodie/ has 4000 objects
    - The exporter is running on AWS EMR.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] ft-bazookanu commented on issue #6970: [SUPPORT] Performance of Snapshot Exporter

Posted by GitBox <gi...@apache.org>.
ft-bazookanu commented on issue #6970:
URL: https://github.com/apache/hudi/issues/6970#issuecomment-1288194206

   Please see https://hudi.apache.org/docs/snapshot_exporter/- partitioner configs are ignored when the output format is hudi. Moreover we're using this as a backup and do not want to repartition. I feel my issue is orthogonal to partitioning:
   - why does performance _decrease_ on increasing memory/cores per executor?
   - why does performance saturate at 16 executors, although the table has far more than 16 partitions? 
   -
   Most of the time is spent exporting the contents of `.hoodie/`, which appears to be happening serially (not parallel). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] ft-bazookanu commented on issue #6970: [SUPPORT] Performance of Snapshot Exporter

Posted by GitBox <gi...@apache.org>.
ft-bazookanu commented on issue #6970:
URL: https://github.com/apache/hudi/issues/6970#issuecomment-1287544932

   @nsivabalan thanks for the response. `--output-format` is set to `hudi` so `--output-partitioner` is not an option. Is there anything else we can do?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan commented on issue #6970: [SUPPORT] Performance of Snapshot Exporter

Posted by GitBox <gi...@apache.org>.
xushiyan commented on issue #6970:
URL: https://github.com/apache/hudi/issues/6970#issuecomment-1321694740

   > Please see https://hudi.apache.org/docs/snapshot_exporter/ partitioner configs are ignored when the output format is hudi. Moreover we're using this as a backup and do not want to repartition. I feel my issue is orthogonal to partitioning:
   > 
   > * why does performance _decrease_ on increasing memory/cores per executor?
   > * why does performance saturate at 16 executors, although the table has far more than 16 partitions?
   > 
   > Most of the time is spent exporting the contents of `.hoodie/`, which appears to be happening serially (not parallel).
   
   @ft-bazookanu thanks for raising the problems. there are a few points for perf improvements in exporter, for example, the parallelism was not set properly for list partitions and base files to copy. also when copy commit files under `.hoodie/` , it is currently made as serial in a for loop. This jira HUDI-712 was filed long time back and deprioritized. Now we can pick it up since it benefits real use cases now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #6970: [SUPPORT] Performance of Snapshot Exporter

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #6970:
URL: https://github.com/apache/hudi/issues/6970#issuecomment-1287879863

   can you try setting `--output-partition-field`. that should also repartition your data based on this field and will increase your parallelism. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #6970: [SUPPORT] Performance of Snapshot Exporter

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #6970:
URL: https://github.com/apache/hudi/issues/6970#issuecomment-1286466166

   can you try setting partitioner 
   `--output-partitioner`
   this should help improve performance. 
   
   also, are you setting ay value for `--output-partition-field` ? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope closed issue #6970: [SUPPORT] Performance of Snapshot Exporter

Posted by GitBox <gi...@apache.org>.
codope closed issue #6970: [SUPPORT] Performance of Snapshot Exporter
URL: https://github.com/apache/hudi/issues/6970


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org