You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/08/26 13:54:39 UTC

[GitHub] [hudi] zherenyu831 opened a new issue #2043: [SUPPORT] hudi 0.6.0 async compaction not working with foreachBatch of spark structured stream

zherenyu831 opened a new issue #2043:
URL: https://github.com/apache/hudi/issues/2043


   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://cwiki.apache.org/confluence/display/HUDI/FAQ)?
   
   - Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   We are using foreachBatch on spark structured stream to appending data to hudi.
   but seems compaction never happens
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1.
   ```
   val query: StreamingQuery = inputDF
         .writeStream
         .foreachBatch { (kafkaMessages: DataFrame, batchId: Long) =>
           kafkaMessages
             .write
             .format("hudi")
             .options(hudiOptions)
             .mode(SaveMode.Append)
             .save(s"s3a://daas-hudi-test/york_test/$tableName")
         }
         .trigger(trigger)
         .option("checkpointLocation", s"${params.baseOutputFolder}/spark-checkpoint")
         .start()
   ```
   
   ```
   ╔═════════════════════════╤═══════════╤═══════════════════════════════╗
   ║ Compaction Instant Time │ State     │ Total FileIds to be Compacted ║
   ╠═════════════════════════╪═══════════╪═══════════════════════════════╣
   ║ 20200825133707           │ REQUESTED │ 6                             ║
   ╟─────────────────────────┼───────────┼───────────────────────────────╢
   ╠═════════════════════════╪═══════════╪═══════════════════════════════╣
   ║ 20200825095605          │ REQUESTED │ 4                             ║
   ╟─────────────────────────┼───────────┼───────────────────────────────╢
   ```
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version : 0.6.0
   
   * Spark version : 2.4.4
   
   * Hive version : 3.1.2
   
   * Hadoop version : 3.2.1
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : No
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #2043: [SUPPORT] hudi 0.6.0 async compaction not working with foreachBatch of spark structured stream

Posted by GitBox <gi...@apache.org>.

bvaradar commented on issue #2043:
URL: https://github.com/apache/hudi/issues/2043#issuecomment-682100271


   @zherenyu831 : Can you model your query using pure structured streaming APIs and avoid foreachBatch. It looks like foreachBatch is triggering batch sink and not streaming sink APIs. We will have a blog shortly on the usage but you can reference the PR : https://github.com/apache/hudi/pull/1996/files#diff-cb5b78d0c2deafe117b643f5de250a17R50
   
   Also, please note that we have discovered an issue related to batch writes https://issues.apache.org/jira/browse/HUDI-1230
   I have sent an email to dev@ and users@ Mailing list on the config change to workaround. 
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] zherenyu831 closed issue #2043: [SUPPORT] hudi 0.6.0 async compaction not working with foreachBatch of spark structured stream

Posted by GitBox <gi...@apache.org>.

zherenyu831 closed issue #2043:
URL: https://github.com/apache/hudi/issues/2043


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] zherenyu831 commented on issue #2043: [SUPPORT] hudi 0.6.0 async compaction not working with foreachBatch of spark structured stream

Posted by GitBox <gi...@apache.org>.

zherenyu831 commented on issue #2043:
URL: https://github.com/apache/hudi/issues/2043#issuecomment-682305478


   @bvaradar 
   Thank you for reply, I also saw your blog pr before, and it work with pure structured streaming api
   Marked, will try to avoid this issue when batch writing


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org