You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "nbalajee (via GitHub)" <gi...@apache.org> on 2023/06/26 16:52:27 UTC

[GitHub] [hudi] nbalajee commented on pull request #9035: [HUDI-6416] Completion markers for handling execution engine (spark) …

nbalajee commented on PR #9035:
URL: https://github.com/apache/hudi/pull/9035#issuecomment-1607857303

   > Thanks for the contribution @nbalajee , In general I'm confused why we need two marker files for each base file, before the patch, we have in-progress marker file and write status real paths, we can diff out the corrupt/retry files by comparing the in-progress marker file handles and the paths recorded in writestatus.
   > 
   > And we also have some instant completion check in HoodieFileSystemView, to ignore the files/file blocks that are still pending, so why the reader view could read data sets that are not intented to be exposed?
   
   Thanks for your review @dannyhchen and @nsivabalan for the review.
   
   > Thanks for the contribution @nbalajee , In general I'm confused why we need two marker files for each base file, before the patch, we have in-progress marker file and write status real paths, we can diff out the corrupt/retry files by comparing the in-progress marker file handles and the paths recorded in writestatus.
   > 
   > And we also have some instant completion check in HoodieFileSystemView, to ignore the files/file blocks that are still pending, so why the reader view could read data sets that are not intented to be exposed?
   
   Following diagram summarizes the issue. 
   (a) when a batch of records given to an executor for writing, spills over to multiple data files (split into multiple parts due to file size limits, f1-0_w1_c1.parquet, f1-1_w1_c1.parquet etc)
   (b) A spark stage is retried as a result all tasks are retried (some of the tasks from previous attempts could still be on-going).  Mainly happens with spark fetchfailed exception.
   
   ![Screenshot 2023-06-25 at 9 15 35 PM](https://github.com/apache/hudi/assets/47542891/7121d7e6-e624-4743-ad00-004fde3e8344)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org