You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by "tatiana-rackspace (via GitHub)" <gi...@apache.org> on 2023/03/02 17:36:28 UTC

[GitHub] [hudi] tatiana-rackspace opened a new issue, #8085: [SUPPORT] deltacommit triggering criteria

tatiana-rackspace opened a new issue, #8085:
URL: https://github.com/apache/hudi/issues/8085

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   Can you please help us understand - when delta commit is triggered  for MoR tables - what are the criteria? Is it  by number of records or by number of seconds?
   
   **To Reproduce**
   We are running delta streamer on EMR to ingest files from S3.
   
   deltastreamer config: 
   ```
   spark-submit \
   --jars /usr/lib/hudi/hudi-utilities-bundle.jar,/usr/lib/hudi/hudi-aws-bundle.jar \
   --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer /usr/lib/hudi/hudi-utilities-bundle.jar \
   --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
   --source-ordering-field ts \
   --target-base-path s3a://hudi-test-table/deltastreamer_test_npartitioned/ \
   --target-table deltastreamer_test_npartitioned \
   --enable-sync \
   --sync-tool-classes org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool \
   --table-type MERGE_ON_READ \
   --op UPSERT \
   --continuous \
   --hoodie-conf hoodie.deltastreamer.source.dfs.root=s3://hudi-test-s3-target/parquet/public/users_cdc_test/2023/03/01/ \
   --hoodie-conf hoodie.datasource.hive_sync.mode=hms \
   --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \
   --hoodie-conf hoodie.datasource.hive_sync.database=hudideltastreamer \
   --hoodie-conf hoodie.datasource.hive_sync.table=deltastreamer_test_npartitioned \
   --hoodie-conf hoodie.datasource.write.recordkey.field=user_id \
   --hoodie-conf hoodie.datasource.write.partitionpath.field="" \
   --hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.NonpartitionedKeyGenerator \
   --hoodie-conf hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.NonPartitionedExtractor
   ```
   
   
   Trying to test how it ingests 1000 small files(around 10 Kb each) from S3 (inserted 80000 rows into table).
   1.Test:
   Files are generated and placed to S3 first.
   Deltastreamer starts after all files are there, ingests them and  produces single delta commit.
   
   2.Test: 
   The same amount of data -  1000 small files(around 10 Kb each). 
   Files are generated and placed to S3 and deltastreamer  running at the same time in continuous mode.
   Timeline:
   12:00 Deltastreamer started running in continuous mode
   12:03.48  Parquet files started arriving to s3 from  12:03.48    to 12:04:53 
   12:03.48 Deltastreamer started processing them
   12:04:53  last file arrived to S3
   12:05.20  deltastreamer finished data ingestion
   
   During this time 3 deltacommits were generated.
   
   **Expected behavior**
   
   Please can you help us understand why there is 1 delta commit in the first test  and 3 delta commits in the second test with the same amount of input data? How  delta commit is triggered - what are the criteria? 
   
   **Environment Description**
   
   * Hudi version : 12.1
   
   * Spark version : 3.3
   
   * Hive version : 
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) :
   
   * Running on Docker? (yes/no) :
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] tatiana-rackspace commented on issue #8085: [SUPPORT] deltacommit triggering criteria

Posted by "tatiana-rackspace (via GitHub)" <gi...@apache.org>.

tatiana-rackspace commented on issue #8085:
URL: https://github.com/apache/hudi/issues/8085#issuecomment-1456894569

   Thank you for your reply. By default `min-sync-interval-seconds` is set to 0 and `source-limit` is unlimited. So with the default settings - it will check the source, fetch all available data, commit and then repeat all the process again?(check, fetch commit). Is this correct understanding?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan closed issue #8085: [SUPPORT] deltacommit triggering criteria

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.

nsivabalan closed issue #8085: [SUPPORT] deltacommit triggering criteria
URL: https://github.com/apache/hudi/issues/8085


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #8085: [SUPPORT] deltacommit triggering criteria

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.

nsivabalan commented on issue #8085:
URL: https://github.com/apache/hudi/issues/8085#issuecomment-1453904397

   hey hi @tatiana-rackspace :
   Deltastreamer as you might know is a streaming ingestion tool. 
   we have some source limit to consume for each batch. 
   incase fo kafka, its no of msgs. incase of DFS based sources, its number of bytes.
   
   you can configure the source limit using `--source-limit`. More info can be found here https://hudi.apache.org/docs/hoodie_deltastreamer 
   
   also, it depends on how much data was available when sync() was called. 
   lets say you have configured the min-sync-interval to 30 mins(`--min-sync-interval-seconds`), deltastreamer will try to fetch data from source and sync to hudi once every 30 mins, 
   So, at t0, it will consume from source adhering to max limit you have configured. and then after 30 mins, it will again consume from source based on last checkpoint, again adhering to the source limit. 
   
   Let me know if this clarifies things. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #8085: [SUPPORT] deltacommit triggering criteria

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.

nsivabalan commented on issue #8085:
URL: https://github.com/apache/hudi/issues/8085#issuecomment-1465084173

   yes, you are right. 
   going ahead and closing out the github issue. Feel free to open a new one if you have any doubts or have more clarifications


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org