You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/09/30 17:35:56 UTC

[GitHub] [hudi] nsivabalan commented on issue #3739: Hoodie clean is not deleting old files

nsivabalan commented on issue #3739:
URL: https://github.com/apache/hudi/issues/3739#issuecomment-931525996


   thanks. Here is what is possibly happening. If you can tigger more updates, eventually you will see cleaning kicking in. 
   In short, this has something to do w/ MOR table. cleaner need to see N commits before it can clean things up not delta commits. I will try to explain what that means, but its gonna be lengthy. 
   
   Let me first try to explains data files and delta log files. 
   In hudi, base or data files are parquet format and delta log files are avro with .log extension. base files are created w/ commits and log files are created w/ delta commits. 
   
   Each data file could have 0 or more log files. They represent updates to data in the respective data files/base files.
   
   
   For instance, here is a simple example.
   base_file_1_c1, 
   log_file_1_c1_v1
   log_file_1_c1_v2
   base_file_2_c2
   
   In above example, there are two commits made, c1, c2 and c3. 
   C1 : 
   base_file_1_c1
   C2:
   base_file_2_c2
   and add some updates to base_file_1 and so log_file_1_c1_v1 got created.
   C3:
   Added some updates to base_file_1 and so log_file_1_c1_v2 got created.
   
   So, if we make making more commits similar to C3, only new log files will be added. These are not considered as commits from a cleaning stand point. 
   Hudi has something called compaction which compacts base files and corresponding log files into a new version of the base file. 
   
   Lets say compaction kicks in with commit time C4. 
   base_file_1_c1, 
   log_file_1_c1_v1
   log_file_1_c1_v2
   base_file_1_c4, 
   base_file_2_c2
   
   base_file_1_c4 is nothing but (base_file_1_c1 + log_file_1_c1_v1 + log_file_1_c1_v2)
   
   Now, lets say you have configured cleaner commits retained as 1, then (base_file_1_c1 + log_file_1_c1_v1 + log_file_1_c1_v2) would have been cleaned up. But as you could see, compaction created a newer version of this base file and hence older version is eligible to be cleaned up. which is not the case for base_file_2_c2. Bcoz, there is only one version.
   
   So, in your case, only when 4 or 5 compactions happen, you could possibly see cleaner kicking in. 
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org