You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "nsivabalan (via GitHub)" <gi...@apache.org> on 2023/04/28 23:29:17 UTC

[GitHub] [hudi] nsivabalan commented on issue #8584: [SUPPORT] Spark SQL query FileNotFoundException using cleaner policy KEEP_LATEST_BY_HOURS

nsivabalan commented on issue #8584:
URL: https://github.com/apache/hudi/issues/8584#issuecomment-1528193678

   hey @tpcross : can you share the entire contents of ".hoodie" for us to inspect. since its in S3, when you want to get it locally, can you do rsync and not "cp" so that last mod times are intact. 
   
   From what I can glean this is what you are reporting. 
   
   The file group of interest just had only one file slice which was dated 23rd nov, 2022. 
   Query started around april 21 ish, 2023. and new commits added two new file slices.
   I assume in-between these two time frames, there are no other commits which created new file slices for the file group of interest. can you confirm that. 
   
   But in 2.5 hours, the cleaner remove the file slices created on 23rd nov which the current query was actually trying to read and it failed. 
   
   I went through the code. 
   From what I see, this is what the code is supposed to do. I need to test it out /reproduce to confirm though. 
   
   Whenever clean planning kicks in, we deduce the earliest commit to retain based on the number of hours configured. for eg, if you have configured hours as 12. we will walk through the timeline and choose the commit just before 12 hours. 
   
   and then for each file group of interest. 
       among all file slices, we will choose the latest file slice just before the earliest commit to retain. So, in above example it should have chosen the file slice for 23rd nov. // again. I assume after 23rd nov, up until april 21, there are no other file slices created. 
      once obtained, we will ignore that file slice (the latest just before earliest commit to retain) and then remove all earlier file slices. 
   
   So, I don't see any issue here. 
   
   If you can confirm the details asked for, would be helpful.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org