You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/03/16 06:05:49 UTC

[GitHub] [hudi] rohit-m-99 opened a new issue #5050: [SUPPORT] Hudi clustering / deleting markers taking significant resources and time

rohit-m-99 opened a new issue #5050:
URL: https://github.com/apache/hudi/issues/5050


   **Describe the problem you faced**
   
   The deltastreamer requires significant amount of resources and is struggling to delete file markers during clustering. The image below shows the clustering taking over 3 hours to run. It also causes many pods to evict by requiring more than available storage.
   
   <img width="1435" alt="image" src="https://user-images.githubusercontent.com/84733594/158526765-c5d31bd5-367a-4e6e-b929-09c2c2297468.png">
   
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Have a large number of S3 files
   2. Run deltastreamer script below
   
   **Expected behavior**
   
   Deltastreamer updates should happen continuously in continuous mode.
   
   **Environment Description**
   
   * Hudi version : 10.1
   * Spark version :3.0.3
   * Hadoop version : 3.2.0
   * Storage (HDFS/S3/GCS..) : S3
   * Running on Docker? (yes/no) : Yes
   
   **Additional context**
   
   Spark Submit Job:
   
   ```
   spark-submit \
   --jars /opt/spark/jars/hudi-spark3-bundle.jar,/opt/spark/jars/hadoop-aws.jar,/opt/spark/jars/aws-java-sdk.jar,/opt/spark/jars/spark-avro.jar \
   --master spark://spark-master:7077 \
   --driver-memory 4g \
   --executor-memory 4g \
   --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
   --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer opt/spark/jars/hudi-utilities-bundle.jar \
   --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
   --target-table per_tick_stats \
   --table-type COPY_ON_WRITE \
   --continuous \
   --source-ordering-field STATOVYGIYLUMVSF6YLU \
   --target-base-path s3a://simian-example-prod-output/stats/querying \
   --hoodie-conf hoodie.deltastreamer.source.dfs.root=s3a://simian-example-prod-output/stats/ingesting \
   --hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator \
   --hoodie-conf hoodie.datasource.write.recordkey.field=STATONUW25LMMF2GS33OL5ZHK3S7NFSA____,STATONUW2X3UNFWWK___ \
   --hoodie-conf hoodie.datasource.write.precombine.field=STATOVYGIYLUMVSF6YLU \
   --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=STATONUW25LMMF2GS33OL5ZHK3S7NFSA____,STATMJQXIY3IL5ZHK3S7NFSA____ \
   --hoodie-conf hoodie.clustering.inline=true \
   --hoodie-conf hoodie.clustering.inline.max.commits=4 \
   --hoodie-conf hoodie.datasource.write.partitionpath.field= 
   ```
   
   **Stacktrace**
   
   No errors just taking a lot of time.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Rohit42 commented on issue #5050: [SUPPORT] Hudi clustering / deleting markers taking significant resources and time

Posted by GitBox <gi...@apache.org>.
Rohit42 commented on issue #5050:
URL: https://github.com/apache/hudi/issues/5050#issuecomment-1069197079


   <img width="1435" alt="image" src="https://user-images.githubusercontent.com/8977448/158614276-b5037ed4-975b-4923-ace2-3c47610fa0d4.png">
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] rohit-m-99 commented on issue #5050: [SUPPORT] Hudi clustering / deleting markers taking significant resources and time

Posted by GitBox <gi...@apache.org>.
rohit-m-99 commented on issue #5050:
URL: https://github.com/apache/hudi/issues/5050#issuecomment-1071004714


   Still seeing this issue - on another cluster looks like: 
   <img width="1420" alt="image" src="https://user-images.githubusercontent.com/84733594/158839655-18e59b61-be5b-4277-bd59-4bde4ecc6270.png">
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] rohit-m-99 commented on issue #5050: [SUPPORT] Hudi clustering / deleting markers taking significant resources and time

Posted by GitBox <gi...@apache.org>.
rohit-m-99 commented on issue #5050:
URL: https://github.com/apache/hudi/issues/5050#issuecomment-1071004714


   Still seeing this issue - on another cluster looks like: 
   <img width="1420" alt="image" src="https://user-images.githubusercontent.com/84733594/158839655-18e59b61-be5b-4277-bd59-4bde4ecc6270.png">
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] rohit-m-99 commented on issue #5050: [SUPPORT] Hudi clustering / deleting markers taking significant resources and time

Posted by GitBox <gi...@apache.org>.
rohit-m-99 commented on issue #5050:
URL: https://github.com/apache/hudi/issues/5050#issuecomment-1072604667


   Spoke with @nsivabalan on this, looks like the issue was not related to DeleteMarkers, rather unioning all the data. Clustering seems to still take a wild amount of resources given that our data < 10GB right now. However that is more discussed here: https://github.com/apache/hudi/issues/4891. 
   
   Issue went away after ridding of a rollbakc and reducing our small file limit: https://hudi.apache.org/docs/configurations/#hoodieclusteringplanstrategysmallfilelimit
   
   Will reopen if issue reappears. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] rohit-m-99 closed issue #5050: [SUPPORT] Hudi clustering / deleting markers taking significant resources and time

Posted by GitBox <gi...@apache.org>.
rohit-m-99 closed issue #5050:
URL: https://github.com/apache/hudi/issues/5050


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org