You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/03/14 17:30:25 UTC

[GitHub] [hudi] rohit-m-99 opened a new issue #5037: [SUPPORT] Deltastreamer continuous mode not working with high number of files in S3

rohit-m-99 opened a new issue #5037:
URL: https://github.com/apache/hudi/issues/5037


   **Describe the problem you faced**
   
   The deltastreamer fails to pick up any updates in the folder despite running in continuous mode when there are many new files. There is no jobs being run in the delta-streamer UI. This happens even when writing to a brand new table.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Have a large number of S3 files -> for us problem occurs around 10k
   2. Run deltastreamer script below
   3. No updates are found
   
   **Expected behavior**
   
   Deltastreamer updates should happen continuously in continuous mode.
   
   **Environment Description**
   
   * Hudi version : 10.1
   * Spark version :3.0.3
   * Hadoop version : 3.2.0
   * Storage (HDFS/S3/GCS..) : S3
   * Running on Docker? (yes/no) : Yes
   
   **Additional context**
   
   Spark Submit Job:
   
   ```
   spark-submit \
   --jars /opt/spark/jars/hudi-spark3-bundle.jar,/opt/spark/jars/hadoop-aws.jar,/opt/spark/jars/aws-java-sdk.jar,/opt/spark/jars/spark-avro.jar \
   --master spark://spark-master:7077 \
   --total-executor-cores 10 \
   --driver-memory 4g \
   --executor-memory 4g \
   --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
   --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer opt/spark/jars/hudi-utilities-bundle.jar \
   --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
   --target-table per_tick_stats \
   --table-type COPY_ON_WRITE \
   --continuous \
   --source-ordering-field STATOVYGIYLUMVSF6YLU \
   --target-base-path s3a://simian-kodiak-prod-output/stats/querying \
   --hoodie-conf hoodie.deltastreamer.source.dfs.root=s3a://simian-kodiak-prod-output/stats/ingesting \
   --hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator \
   --hoodie-conf hoodie.datasource.write.recordkey.field=STATONUW25LMMF2GS33OL5ZHK3S7NFSA____,STATONUW2X3UNFWWK___ \
   --hoodie-conf hoodie.datasource.write.precombine.field=STATOVYGIYLUMVSF6YLU \
   --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=STATONUW25LMMF2GS33OL5ZHK3S7NFSA____,STATMJQXIY3IL5ZHK3S7NFSA____ \
   --hoodie-conf hoodie.clustering.inline=true \
   --hoodie-conf hoodie.clustering.inline.max.commits=4 \
   --hoodie-conf hoodie.datasource.write.partitionpath.field= \
   --source-limit 2147483648
   ```
   
   **Stacktrace**
   
   No log output or errors. Last logs below:
   
   ```
   22/03/11 03:55:32 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from s3a://simian-customer-prod-output/stats/querying
   22/03/11 03:55:32 INFO HoodieTableConfig: Loading table properties from s3a://simian-customer-prod-output/stats/querying/.hoodie/hoodie.properties
   22/03/11 03:55:32 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3a://simian-customer-prod-output/stats/querying
   22/03/11 03:55:33 INFO HoodieActiveTimeline: Loaded instants upto : Option{val=[20220310060713509__rollback__COMPLETED]}
   22/03/11 03:55:33 INFO DFSPathSelector: Using path selector org.apache.hudi.utilities.sources.helpers.DFSPathSelector
   22/03/11 03:55:33 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from s3a://simian-customer-prod-output/stats/querying
   22/03/11 03:55:33 INFO HoodieTableConfig: Loading table properties from s3a://simian-customer-prod-output/stats/querying/.hoodie/hoodie.properties
   22/03/11 03:55:33 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3a://simian-customer-prod-output/stats/querying
   22/03/11 03:55:33 INFO HoodieActiveTimeline: Loaded instants upto : Option{val=[20220310060713509__rollback__COMPLETED]}
   22/03/11 03:55:33 INFO DeltaSync: Checkpoint to resume from : Option{val=1646891776000}
   22/03/11 03:55:33 INFO DFSPathSelector: Root path => s3a://simian-customer-prod-output/stats/ingesting source limit => 2147483648
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #5037: [SUPPORT] Deltastreamer continuous mode not working with high number of files in S3

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #5037:
URL: https://github.com/apache/hudi/issues/5037#issuecomment-1067418764


   Can you set source-limit to Long.maxValue (9223372036854775807) and see what happens. 
   Also, can you add "min-sync-interval-seconds" to something like 1 min. and let us know what do you see. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] rohit-m-99 commented on issue #5037: [SUPPORT] Deltastreamer continuous mode not working with high number of files in S3

Posted by GitBox <gi...@apache.org>.
rohit-m-99 commented on issue #5037:
URL: https://github.com/apache/hudi/issues/5037#issuecomment-1067424393


   Have also tried with the source limit set to the Long.maxValue. Can adjust the min-sync - oddly unable to reproduce the issue here. Now when I rerun the spark submit job everything seems to be fine. Will update this issue if it pops up again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] rohit-m-99 commented on issue #5037: [SUPPORT] Deltastreamer continuous mode not working with high number of files in S3

Posted by GitBox <gi...@apache.org>.
rohit-m-99 commented on issue #5037:
URL: https://github.com/apache/hudi/issues/5037#issuecomment-1071056474


   Like this issue was solved with this fix: https://hudi.apache.org/docs/next/s3_hoodie#aws-s3-versioned-bucket


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] rohit-m-99 commented on issue #5037: [SUPPORT] Deltastreamer continuous mode not working with high number of files in S3

Posted by GitBox <gi...@apache.org>.
rohit-m-99 commented on issue #5037:
URL: https://github.com/apache/hudi/issues/5037#issuecomment-1071056474


   Like this issue was solved with this fix: https://hudi.apache.org/docs/next/s3_hoodie#aws-s3-versioned-bucket


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] rohit-m-99 closed issue #5037: [SUPPORT] Deltastreamer continuous mode not working with high number of files in S3

Posted by GitBox <gi...@apache.org>.
rohit-m-99 closed issue #5037:
URL: https://github.com/apache/hudi/issues/5037


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #5037: [SUPPORT] Deltastreamer continuous mode not working with high number of files in S3

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #5037:
URL: https://github.com/apache/hudi/issues/5037#issuecomment-1067437524


   hmmm. ;) do keep us posted then.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] rohit-m-99 closed issue #5037: [SUPPORT] Deltastreamer continuous mode not working with high number of files in S3

Posted by GitBox <gi...@apache.org>.
rohit-m-99 closed issue #5037:
URL: https://github.com/apache/hudi/issues/5037


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org