You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/03/11 09:33:23 UTC

[GitHub] [hudi] danny0405 opened a new issue #5020: [SUPPORT] The cleaning strategy breaks the reader view completeness

danny0405 opened a new issue #5020:
URL: https://github.com/apache/hudi/issues/5020


   Current we have some cleaning strategy such as: `num_commits`, `delta hours`, `num_versions`.
   Let's say user use the `num_commits` strategy.
   
   And it uses the params:
   
   - max 10 commits to archive
   - min 4 commits to keep in alive
   - 6 commits to clean
   
   c1 ---- c2 ---- c3 ---- c4 ---- c5 ---- c6 ---- c7---- c8 ---- c9 ---- c10
   
   At c10, the reader starts reading the latest fs view with a file slice that was written in c1,
   
   /+
     --- fg1_c1.parquet
   
   And the cleaner also starts working in c10 this time, it finds that the num commits > 6 (10 > 6) and all the files that committed in c1 ~ c4 was deleted. And the reader throws `FileNotFoundException`.
   
   This problem is common and occurs frequently especially in streaming read mode.(also happens if a batch read job is complex and lasts long time).
   
   We need some mechanisms to ensure the semantic integrity of the read view.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #5020: [SUPPORT] The cleaning strategy breaks the reader view completeness

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #5020:
URL: https://github.com/apache/hudi/issues/5020#issuecomment-1073047031


   @danny0405 : any follow ups on this ? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] scxwhite commented on issue #5020: [SUPPORT] The cleaning strategy breaks the reader view completeness

Posted by GitBox <gi...@apache.org>.
scxwhite commented on issue #5020:
URL: https://github.com/apache/hudi/issues/5020#issuecomment-1065895701


   If we want to realize that when the user reads, the data being read is not clean. We may need to add other third-party components. For example, the zookeeper temporary node. Otherwise, we won't know when the read ends or when the read exception crashes


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #5020: [SUPPORT] The cleaning strategy breaks the reader view completeness

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #5020:
URL: https://github.com/apache/hudi/issues/5020#issuecomment-1067572987






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan edited a comment on issue #5020: [SUPPORT] The cleaning strategy breaks the reader view completeness

Posted by GitBox <gi...@apache.org>.
nsivabalan edited a comment on issue #5020:
URL: https://github.com/apache/hudi/issues/5020#issuecomment-1073047088


   is this related to https://issues.apache.org/jira/browse/HUDI-3657 ? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] danny0405 commented on issue #5020: [SUPPORT] The cleaning strategy breaks the reader view completeness

Posted by GitBox <gi...@apache.org>.
danny0405 commented on issue #5020:
URL: https://github.com/apache/hudi/issues/5020#issuecomment-1066289540


   > If we want to realize that when the user reads, the data being read is not clean. We may need to add other third-party components. For example, the zookeeper temporary node. Otherwise, we won't know when the read ends or when the read exception crashes
   
   We may need some contract between the reader and writer, something like the read lock, when a snapshot was reading, the writer can not clean it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #5020: [SUPPORT] The cleaning strategy breaks the reader view completeness

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #5020:
URL: https://github.com/apache/hudi/issues/5020#issuecomment-1073047088


   is this related to https://issues.apache.org/jira/browse/HUDI-2751 ? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org