You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/05/19 18:00:20 UTC

[GitHub] [hudi] pratyakshsharma opened a new pull request #2967: Added blog for Hudi cleaner service

pratyakshsharma opened a new pull request #2967:
URL: https://github.com/apache/hudi/pull/2967


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a pull request.*
   
   ## What is the purpose of the pull request
   
   Added blog for Hudi's cleaner table service.
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
     - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
     - *Added integration tests for end-to-end.*
     - *Added HoodieClientWriteTest to verify the change.*
     - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] pratyakshsharma commented on pull request #2967: Added blog for Hudi cleaner service

Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on pull request #2967:
URL: https://github.com/apache/hudi/pull/2967#issuecomment-850883744


   @nsivabalan Please ignore run command section coming twice. That is because of the way screenshots are taken. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on pull request #2967: Added blog for Hudi cleaner service

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on pull request #2967:
URL: https://github.com/apache/hudi/pull/2967#issuecomment-855224324


   We can land once addressed. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] pratyakshsharma commented on pull request #2967: Added blog for Hudi cleaner service

Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on pull request #2967:
URL: https://github.com/apache/hudi/pull/2967#issuecomment-850883551


   <img width="753" alt="cleaner-blog-1" src="https://user-images.githubusercontent.com/30863489/120081999-1145ec80-c0de-11eb-9280-a91d260a126f.png">
   <img width="755" alt="cleaner-blog-2" src="https://user-images.githubusercontent.com/30863489/120082012-1efb7200-c0de-11eb-8b1c-bc8e98a91c66.png">
   <img width="664" alt="cleaner-blog-3" src="https://user-images.githubusercontent.com/30863489/120082025-3c304080-c0de-11eb-8c6b-242e45d07286.png">
   <img width="664" alt="cleaner-blog-4" src="https://user-images.githubusercontent.com/30863489/120082032-44887b80-c0de-11eb-872f-0692720c9694.png">
   <img width="668" alt="cleaner-blog-5" src="https://user-images.githubusercontent.com/30863489/120082037-49e5c600-c0de-11eb-894d-3e7f73a2cb81.png">
   
   @nsivabalan Please take a look. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] pratyakshsharma commented on pull request #2967: Added blog for Hudi cleaner service

Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on pull request #2967:
URL: https://github.com/apache/hudi/pull/2967#issuecomment-848097336


   @n3nash Ack. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] pratyakshsharma commented on pull request #2967: Added blog for Hudi cleaner service

Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on pull request #2967:
URL: https://github.com/apache/hudi/pull/2967#issuecomment-859838218






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] pratyakshsharma commented on a change in pull request #2967: Added blog for Hudi cleaner service

Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on a change in pull request #2967:
URL: https://github.com/apache/hudi/pull/2967#discussion_r644803395



##########
File path: docs/_posts/2021-05-28-employing-right-configurations-for-hudi-cleaner.md
##########
@@ -0,0 +1,106 @@
+---
+title: "Employing correct configurations for Hudi's cleaner table service"
+excerpt: "Ensuring isolation between Hudi writers and readers using `HoodieCleaner.java`"
+author: pratyakshsharma
+category: blog
+---
+
+Apache Hudi provides snapshot isolation between writers and readers. This is made possible by Hudi’s MVCC concurrency model. In this blog, we will explain how to employ the right configurations to manage multiple file versions. Furthermore, we will discuss mechanisms available to users on how to maintain just the required number of old file versions so that long running readers do not fail. 
+
+### Reclaiming space and keeping your data lake storage costs in check
+
+Hudi provides different table management services to be able to manage your tables on the data lake. One of these services is called the **Cleaner**. As you write more data to your table, for every batch of updates received, Hudi can either generate a new version of the data file with updates applied to records (COPY_ON_WRITE) or write these delta updates to a log file, avoiding rewriting newer version of an existing file (MERGE_ON_READ). In such situations, depending on the frequency of your updates, the number of file versions of log files can grow indefinitely. If your use-cases do not require keeping an infinite history of these versions, it is imperative to have a process that reclaims older versions of the data. This is Hudi’s cleaner service.
+
+### Problem Statement
+
+In a data lake architecture, it is a very common scenario to have readers and writers concurrently accessing the same table. As the Hudi cleaner service periodically reclaims older file versions, scenarios arise where a long running query might be accessing a file version that is deemed to be reclaimed by the cleaner. Here, we need to employ the correct configs to ensure readers (aka queries) don’t fail.
+
+### Deeper dive into Hudi Cleaner
+
+To deal with the mentioned scenario, lets understand the  different cleaning policies that Hudi offers and the corresponding properties that need to be configured. Options are available to schedule cleaning asynchronously or synchronously. Before going into more details, we would like to explain a few underlying concepts:
+
+ - **Hudi base file**: Columnar file which consists of final data after compaction. A base file’s name follows the following naming convention: `<fileId>_<writeToken>_<instantTime>.parquet`. In subsequent writes of this file, file id remains the same and commit time gets updated to show the latest version. This also implies any particular version of a record, given its partition path, can be uniquely located using the file id and instant time. 
+ - **File slice**: A file slice consists of the base file and any log files consisting of the delta, in case of MERGE_ON_READ table type.
+ - **Hudi File Group**: Any file group in Hudi is uniquely identified by the partition path and the  file id that the files in this group have as part of their name. A file group consists of all the file slices in a particular partition path. Also any partition path can have multiple file groups.
+
+### Cleaning Policies
+
+Hudi cleaner currently supports below cleaning policies:
+
+ - **KEEP_LATEST_COMMITS**: This is the default policy. This is a temporal cleaning policy that ensures the effect of having lookback into all the changes that happened in the last X commits. Suppose a writer is ingesting data  into a Hudi dataset every 30 minutes and the longest running query can take 5 hours to finish, then the user should retain atleast the last 10 commits. With such a configuration, we ensure that the oldest version of a file is kept on disk for at least 5 hours, thereby preventing the longest running query from failing at any point in time. Incremental cleaning is also possible using this policy.
+ - **KEEP_LATEST_FILE_VERSIONS**: This policy has the effect of keeping N number of file versions irrespective of time. This policy is useful when it is known how many MAX versions of the file does one want to keep at any given time. To achieve the same behaviour as before of preventing long running queries from failing, one should do their calculations based on data patterns. Alternatively, this policy is also useful if a user just wants to maintain 1 latest version of the file.
+
+### Examples
+
+Suppose a user is ingesting data into a hudi dataset of type COPY_ON_WRITE every 30 minutes as shown below:
+
+![Initial timeline](/assets/images/blog/hoodie-cleaner/Initial_timeline.png)
+_Figure1: Incoming records getting ingested into a hudi dataset every 30 minutes_
+
+The figure shows a particular partition on DFS where commits and corresponding file versions are color coded. 4 different file groups are created in this partition as depicted by fileId1, fileId2, fileId3 and fileId4. File group corresponding to fileId2 has records ingested from all the 5 commits, while the group corresponding to fileId4 has records from the latest 2 commits only.
+
+Suppose the user uses the below configs for cleaning:
+
+```java
+hoodie.cleaner.policy=KEEP_LATEST_COMMITS
+hoodie.cleaner.commits.retained=2
+```
+
+Cleaner selects the versions of files to be cleaned by taking care of the following:
+
+ - Latest version of a file should not be cleaned.
+ - The commit times of the last 2 (configured) + 1 commits are determined. In Figure1, `commit 10:30` and `commit 10:00` correspond to the latest 2 commits in the timeline. One extra commit is included because the time window for retaining commits is essentially equal to the longest query run time. So if the longest query takes 1 hour to finish, and ingestion happens every 30 minutes, you need to retain last 2 commits since 2*30 = 60 (1 hour). At this point of time, the longest query can still be using files written in 3rd commit in reverse order. Essentially this means if a query started executing after `commit 9:30`, it will still be running when clean action is triggered right after `commit 10:30` as in Figure2. 
+ -  Now for any file group, only those file slices are scheduled for cleaning which are not savepointed (another Hudi table service) and whose commit time is less than the 3rd commit (`commit 9:30` in figure below) in reverse order.
+
+![Retain latest commits](/assets/images/blog/hoodie-cleaner/Retain_latest_commits.png)
+_Figure2: Files corresponding to latest 3 commits are retained_
+
+Now, suppose he uses the below configs for cleaning:

Review comment:
       Done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] n3nash commented on pull request #2967: Added blog for Hudi cleaner service

Posted by GitBox <gi...@apache.org>.
n3nash commented on pull request #2967:
URL: https://github.com/apache/hudi/pull/2967#issuecomment-847388400


   @pratyakshsharma Can you address the open items so we can land this soon ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] pratyakshsharma commented on a change in pull request #2967: Added blog for Hudi cleaner service

Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on a change in pull request #2967:
URL: https://github.com/apache/hudi/pull/2967#discussion_r640889787



##########
File path: docs/_posts/2021-05-19-employing-right-configurations-for-hudi-cleaner.md
##########
@@ -0,0 +1,77 @@
+---
+title: "Employing correct configurations for Hudi's cleaner table service"
+excerpt: "Achieving isolation between Hudi writer and readers using `HoodieCleaner.java`"

Review comment:
       done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] pratyakshsharma commented on a change in pull request #2967: Added blog for Hudi cleaner service

Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on a change in pull request #2967:
URL: https://github.com/apache/hudi/pull/2967#discussion_r640890373



##########
File path: docs/_posts/2021-05-19-employing-right-configurations-for-hudi-cleaner.md
##########
@@ -0,0 +1,77 @@
+---
+title: "Employing correct configurations for Hudi's cleaner table service"
+excerpt: "Achieving isolation between Hudi writer and readers using `HoodieCleaner.java`"
+author: pratyakshsharma
+category: blog
+---
+
+Apache Hudi provides snapshot isolation between writers and readers. This is made possible by Hudi’s MVCC concurrency model. In this blog, we will explain how to employ the right configurations to manage multiple file versions. Furthermore, we will discuss mechanisms available to users generating Hudi tables on how to maintain just the required number of old file versions so that long running readers do not fail. 

Review comment:
       done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] n3nash commented on a change in pull request #2967: Added blog for Hudi cleaner service

Posted by GitBox <gi...@apache.org>.
n3nash commented on a change in pull request #2967:
URL: https://github.com/apache/hudi/pull/2967#discussion_r636320181



##########
File path: docs/_posts/2021-05-19-employing-right-configurations-for-hudi-cleaner.md
##########
@@ -0,0 +1,77 @@
+---
+title: "Employing correct configurations for Hudi's cleaner table service"
+excerpt: "Achieving isolation between Hudi writer and readers using `HoodieCleaner.java`"
+author: pratyakshsharma
+category: blog
+---
+
+Apache Hudi provides snapshot isolation between writers and readers. This is made possible by Hudi’s MVCC concurrency model. In this blog, we will explain how to employ the right configurations to manage multiple file versions. Furthermore, we will discuss mechanisms available to users generating Hudi tables on how to maintain just the required number of old file versions so that long running readers do not fail. 
+
+### Reclaiming space and bounding your data lake growth
+
+Hudi provides different table management services to be able to manage your tables on the data lake. One of these services is called the **Cleaner**. As you write more data to your table, for every batch of updates received, Hudi can either generate a new version of the data file with updates applied to records (COPY_ON_WRITE) or write these delta updates to a log file, avoiding rewriting newer version of an existing file (MERGE_ON_READ). In such situations, depending on the frequency of your updates, the number of file versions of log files can grow indefinitely. If your use-cases do not require keeping an infinite history of these versions, it is imperative to have a process that reclaims older versions of the data. This is Hudi’s cleaner service.
+
+### Problem Statement
+
+In a data lake architecture, it is a very common scenario to have readers and writers concurrently accessing the same table. As the Hudi cleaner service periodically reclaims older file versions, scenarios arise where a long running query might be accessing a file version that is deemed to be reclaimed by the cleaner. Here, we need to employ the correct configs to ensure readers (aka queries) don’t fail.
+
+### Deeper dive into Hudi Cleaner
+
+To deal with the mentioned scenario, lets understand the  different cleaning policies that Hudi offers and the corresponding properties that need to be configured. Options are available to schedule cleaning asynchronously or synchronously. Before going into more details, we would like to explain a few underlying concepts:
+
+ - **Hudi base file**: Columnar file which consists of final data after compaction. A base file’s name follows the following naming convention: `<fileId>_<writeToken>_<instantTime>.parquet`. In subsequent writes of this file, file id remains the same and commit time gets updated to show the latest version. This also implies any particular version of a record, given its partition path, can be uniquely located using the file id and instant time. 
+ - **File slice**: A file slice consists of the base file and any log files consisting of the delta, in case of MERGE_ON_READ table type.
+ - **Hudi File Group**: Any file group in Hudi is uniquely identified by the partition path and the  file id that the files in this group have as part of their name. A file group consists of all the file slices in a particular partition path. Also any partition path can have multiple file groups.
+
+### Cleaning Policies
+
+Hudi cleaner currently supports below cleaning policies:
+
+ - **KEEP_LATEST_COMMITS**: This is the default policy. This is a temporal cleaning policy that ensures the effect of having lookback into all the changes that happened in the last X commits. Suppose a writer ingesting data  into a Hudi dataset every 30 minutes and the longest running query can take 5 hours to finish, then the user should retain atleast the last 10 commits. With such a configuration, we ensure that the oldest version of a file is kept on disk for at least 5 hours, thereby preventing the longest running query from failing at any point in time. Incremental cleaning is also possible using this policy.
+ - **KEEP_LATEST_FILE_VERSIONS**: This is a static numeric policy that has the effect of keeping N number of file versions irrespective of time. This policy is use-ful when it is known how many MAX versions of the file does one want to keep at any given time. To achieve the same behaviour as before of preventing long running queries from failing, one should do their calculations based on data patterns. Alternatively, this policy is also useful if a user just wants to maintain 1 latest version of the file.

Review comment:
       Let's remove "is a static numeric" 

##########
File path: docs/_posts/2021-05-19-employing-right-configurations-for-hudi-cleaner.md
##########
@@ -0,0 +1,77 @@
+---
+title: "Employing correct configurations for Hudi's cleaner table service"
+excerpt: "Achieving isolation between Hudi writer and readers using `HoodieCleaner.java`"
+author: pratyakshsharma
+category: blog
+---
+
+Apache Hudi provides snapshot isolation between writers and readers. This is made possible by Hudi’s MVCC concurrency model. In this blog, we will explain how to employ the right configurations to manage multiple file versions. Furthermore, we will discuss mechanisms available to users generating Hudi tables on how to maintain just the required number of old file versions so that long running readers do not fail. 
+
+### Reclaiming space and bounding your data lake growth
+
+Hudi provides different table management services to be able to manage your tables on the data lake. One of these services is called the **Cleaner**. As you write more data to your table, for every batch of updates received, Hudi can either generate a new version of the data file with updates applied to records (COPY_ON_WRITE) or write these delta updates to a log file, avoiding rewriting newer version of an existing file (MERGE_ON_READ). In such situations, depending on the frequency of your updates, the number of file versions of log files can grow indefinitely. If your use-cases do not require keeping an infinite history of these versions, it is imperative to have a process that reclaims older versions of the data. This is Hudi’s cleaner service.
+
+### Problem Statement
+
+In a data lake architecture, it is a very common scenario to have readers and writers concurrently accessing the same table. As the Hudi cleaner service periodically reclaims older file versions, scenarios arise where a long running query might be accessing a file version that is deemed to be reclaimed by the cleaner. Here, we need to employ the correct configs to ensure readers (aka queries) don’t fail.
+
+### Deeper dive into Hudi Cleaner
+
+To deal with the mentioned scenario, lets understand the  different cleaning policies that Hudi offers and the corresponding properties that need to be configured. Options are available to schedule cleaning asynchronously or synchronously. Before going into more details, we would like to explain a few underlying concepts:
+
+ - **Hudi base file**: Columnar file which consists of final data after compaction. A base file’s name follows the following naming convention: `<fileId>_<writeToken>_<instantTime>.parquet`. In subsequent writes of this file, file id remains the same and commit time gets updated to show the latest version. This also implies any particular version of a record, given its partition path, can be uniquely located using the file id and instant time. 
+ - **File slice**: A file slice consists of the base file and any log files consisting of the delta, in case of MERGE_ON_READ table type.
+ - **Hudi File Group**: Any file group in Hudi is uniquely identified by the partition path and the  file id that the files in this group have as part of their name. A file group consists of all the file slices in a particular partition path. Also any partition path can have multiple file groups.
+
+### Cleaning Policies
+
+Hudi cleaner currently supports below cleaning policies:
+
+ - **KEEP_LATEST_COMMITS**: This is the default policy. This is a temporal cleaning policy that ensures the effect of having lookback into all the changes that happened in the last X commits. Suppose a writer ingesting data  into a Hudi dataset every 30 minutes and the longest running query can take 5 hours to finish, then the user should retain atleast the last 10 commits. With such a configuration, we ensure that the oldest version of a file is kept on disk for at least 5 hours, thereby preventing the longest running query from failing at any point in time. Incremental cleaning is also possible using this policy.
+ - **KEEP_LATEST_FILE_VERSIONS**: This is a static numeric policy that has the effect of keeping N number of file versions irrespective of time. This policy is use-ful when it is known how many MAX versions of the file does one want to keep at any given time. To achieve the same behaviour as before of preventing long running queries from failing, one should do their calculations based on data patterns. Alternatively, this policy is also useful if a user just wants to maintain 1 latest version of the file.

Review comment:
       use-ful to useful 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a change in pull request #2967: Added blog for Hudi cleaner service

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on a change in pull request #2967:
URL: https://github.com/apache/hudi/pull/2967#discussion_r646041232



##########
File path: docs/_posts/2021-06-03-employing-right-configurations-for-hudi-cleaner.md
##########
@@ -0,0 +1,106 @@
+---
+title: "Employing correct configurations for Hudi's cleaner table service"
+excerpt: "Ensuring isolation between Hudi writers and readers using `HoodieCleaner.java`"
+author: pratyakshsharma
+category: blog
+---
+
+Apache Hudi provides snapshot isolation between writers and readers. This is made possible by Hudi’s MVCC concurrency model. In this blog, we will explain how to employ the right configurations to manage multiple file versions. Furthermore, we will discuss mechanisms available to users on how to maintain just the required number of old file versions so that long running readers do not fail. 
+
+### Reclaiming space and keeping your data lake storage costs in check
+
+Hudi provides different table management services to be able to manage your tables on the data lake. One of these services is called the **Cleaner**. As you write more data to your table, for every batch of updates received, Hudi can either generate a new version of the data file with updates applied to records (COPY_ON_WRITE) or write these delta updates to a log file, avoiding rewriting newer version of an existing file (MERGE_ON_READ). In such situations, depending on the frequency of your updates, the number of file versions of log files can grow indefinitely. If your use-cases do not require keeping an infinite history of these versions, it is imperative to have a process that reclaims older versions of the data. This is Hudi’s cleaner service.
+
+### Problem Statement
+
+In a data lake architecture, it is a very common scenario to have readers and writers concurrently accessing the same table. As the Hudi cleaner service periodically reclaims older file versions, scenarios arise where a long running query might be accessing a file version that is deemed to be reclaimed by the cleaner. Here, we need to employ the correct configs to ensure readers (aka queries) don’t fail.
+
+### Deeper dive into Hudi Cleaner
+
+To deal with the mentioned scenario, lets understand the  different cleaning policies that Hudi offers and the corresponding properties that need to be configured. Options are available to schedule cleaning asynchronously or synchronously. Before going into more details, we would like to explain a few underlying concepts:
+
+ - **Hudi base file**: Columnar file which consists of final data after compaction. A base file’s name follows the following naming convention: `<fileId>_<writeToken>_<instantTime>.parquet`. In subsequent writes of this file, file id remains the same and commit time gets updated to show the latest version. This also implies any particular version of a record, given its partition path, can be uniquely located using the file id and instant time. 
+ - **File slice**: A file slice consists of the base file and any log files consisting of the delta, in case of MERGE_ON_READ table type.
+ - **Hudi File Group**: Any file group in Hudi is uniquely identified by the partition path and the  file id that the files in this group have as part of their name. A file group consists of all the file slices in a particular partition path. Also any partition path can have multiple file groups.
+
+### Cleaning Policies
+
+Hudi cleaner currently supports below cleaning policies:
+
+ - **KEEP_LATEST_COMMITS**: This is the default policy. This is a temporal cleaning policy that ensures the effect of having lookback into all the changes that happened in the last X commits. Suppose a writer is ingesting data  into a Hudi dataset every 30 minutes and the longest running query can take 5 hours to finish, then the user should retain atleast the last 10 commits. With such a configuration, we ensure that the oldest version of a file is kept on disk for at least 5 hours, thereby preventing the longest running query from failing at any point in time. Incremental cleaning is also possible using this policy.
+ - **KEEP_LATEST_FILE_VERSIONS**: This policy has the effect of keeping N number of file versions irrespective of time. This policy is useful when it is known how many MAX versions of the file does one want to keep at any given time. To achieve the same behaviour as before of preventing long running queries from failing, one should do their calculations based on data patterns. Alternatively, this policy is also useful if a user just wants to maintain 1 latest version of the file.
+
+### Examples
+
+Suppose a user is ingesting data into a hudi dataset of type COPY_ON_WRITE every 30 minutes as shown below:
+
+![Initial timeline](/assets/images/blog/hoodie-cleaner/Initial_timeline.png)
+_Figure1: Incoming records getting ingested into a hudi dataset every 30 minutes_
+
+The figure shows a particular partition on DFS where commits and corresponding file versions are color coded. 4 different file groups are created in this partition as depicted by fileId1, fileId2, fileId3 and fileId4. File group corresponding to fileId2 has records ingested from all the 5 commits, while the group corresponding to fileId4 has records from the latest 2 commits only.

Review comment:
       yeah, I do get it. But feel using file group in figure would make sense as it represents multiple file slices. 

##########
File path: docs/_posts/2021-06-03-employing-right-configurations-for-hudi-cleaner.md
##########
@@ -0,0 +1,106 @@
+---
+title: "Employing correct configurations for Hudi's cleaner table service"
+excerpt: "Ensuring isolation between Hudi writers and readers using `HoodieCleaner.java`"
+author: pratyakshsharma
+category: blog
+---
+
+Apache Hudi provides snapshot isolation between writers and readers. This is made possible by Hudi’s MVCC concurrency model. In this blog, we will explain how to employ the right configurations to manage multiple file versions. Furthermore, we will discuss mechanisms available to users on how to maintain just the required number of old file versions so that long running readers do not fail. 
+
+### Reclaiming space and keeping your data lake storage costs in check
+
+Hudi provides different table management services to be able to manage your tables on the data lake. One of these services is called the **Cleaner**. As you write more data to your table, for every batch of updates received, Hudi can either generate a new version of the data file with updates applied to records (COPY_ON_WRITE) or write these delta updates to a log file, avoiding rewriting newer version of an existing file (MERGE_ON_READ). In such situations, depending on the frequency of your updates, the number of file versions of log files can grow indefinitely. If your use-cases do not require keeping an infinite history of these versions, it is imperative to have a process that reclaims older versions of the data. This is Hudi’s cleaner service.
+
+### Problem Statement
+
+In a data lake architecture, it is a very common scenario to have readers and writers concurrently accessing the same table. As the Hudi cleaner service periodically reclaims older file versions, scenarios arise where a long running query might be accessing a file version that is deemed to be reclaimed by the cleaner. Here, we need to employ the correct configs to ensure readers (aka queries) don’t fail.
+
+### Deeper dive into Hudi Cleaner
+
+To deal with the mentioned scenario, lets understand the  different cleaning policies that Hudi offers and the corresponding properties that need to be configured. Options are available to schedule cleaning asynchronously or synchronously. Before going into more details, we would like to explain a few underlying concepts:
+
+ - **Hudi base file**: Columnar file which consists of final data after compaction. A base file’s name follows the following naming convention: `<fileId>_<writeToken>_<instantTime>.parquet`. In subsequent writes of this file, file id remains the same and commit time gets updated to show the latest version. This also implies any particular version of a record, given its partition path, can be uniquely located using the file id and instant time. 
+ - **File slice**: A file slice consists of the base file and any log files consisting of the delta, in case of MERGE_ON_READ table type.
+ - **Hudi File Group**: Any file group in Hudi is uniquely identified by the partition path and the  file id that the files in this group have as part of their name. A file group consists of all the file slices in a particular partition path. Also any partition path can have multiple file groups.
+
+### Cleaning Policies
+
+Hudi cleaner currently supports below cleaning policies:
+
+ - **KEEP_LATEST_COMMITS**: This is the default policy. This is a temporal cleaning policy that ensures the effect of having lookback into all the changes that happened in the last X commits. Suppose a writer is ingesting data  into a Hudi dataset every 30 minutes and the longest running query can take 5 hours to finish, then the user should retain atleast the last 10 commits. With such a configuration, we ensure that the oldest version of a file is kept on disk for at least 5 hours, thereby preventing the longest running query from failing at any point in time. Incremental cleaning is also possible using this policy.
+ - **KEEP_LATEST_FILE_VERSIONS**: This policy has the effect of keeping N number of file versions irrespective of time. This policy is useful when it is known how many MAX versions of the file does one want to keep at any given time. To achieve the same behaviour as before of preventing long running queries from failing, one should do their calculations based on data patterns. Alternatively, this policy is also useful if a user just wants to maintain 1 latest version of the file.
+
+### Examples
+
+Suppose a user is ingesting data into a hudi dataset of type COPY_ON_WRITE every 30 minutes as shown below:
+
+![Initial timeline](/assets/images/blog/hoodie-cleaner/Initial_timeline.png)
+_Figure1: Incoming records getting ingested into a hudi dataset every 30 minutes_
+
+The figure shows a particular partition on DFS where commits and corresponding file versions are color coded. 4 different file groups are created in this partition as depicted by fileId1, fileId2, fileId3 and fileId4. File group corresponding to fileId2 has records ingested from all the 5 commits, while the group corresponding to fileId4 has records from the latest 2 commits only.

Review comment:
       yeah, I do get it. But feel using file group in figure would make sense as it represents multiple file slices together. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] n3nash commented on a change in pull request #2967: Added blog for Hudi cleaner service

Posted by GitBox <gi...@apache.org>.
n3nash commented on a change in pull request #2967:
URL: https://github.com/apache/hudi/pull/2967#discussion_r636321805



##########
File path: docs/_posts/2021-05-19-employing-right-configurations-for-hudi-cleaner.md
##########
@@ -0,0 +1,77 @@
+---
+title: "Employing correct configurations for Hudi's cleaner table service"
+excerpt: "Achieving isolation between Hudi writer and readers using `HoodieCleaner.java`"
+author: pratyakshsharma
+category: blog
+---
+
+Apache Hudi provides snapshot isolation between writers and readers. This is made possible by Hudi’s MVCC concurrency model. In this blog, we will explain how to employ the right configurations to manage multiple file versions. Furthermore, we will discuss mechanisms available to users generating Hudi tables on how to maintain just the required number of old file versions so that long running readers do not fail. 
+
+### Reclaiming space and bounding your data lake growth
+
+Hudi provides different table management services to be able to manage your tables on the data lake. One of these services is called the **Cleaner**. As you write more data to your table, for every batch of updates received, Hudi can either generate a new version of the data file with updates applied to records (COPY_ON_WRITE) or write these delta updates to a log file, avoiding rewriting newer version of an existing file (MERGE_ON_READ). In such situations, depending on the frequency of your updates, the number of file versions of log files can grow indefinitely. If your use-cases do not require keeping an infinite history of these versions, it is imperative to have a process that reclaims older versions of the data. This is Hudi’s cleaner service.
+
+### Problem Statement
+
+In a data lake architecture, it is a very common scenario to have readers and writers concurrently accessing the same table. As the Hudi cleaner service periodically reclaims older file versions, scenarios arise where a long running query might be accessing a file version that is deemed to be reclaimed by the cleaner. Here, we need to employ the correct configs to ensure readers (aka queries) don’t fail.
+
+### Deeper dive into Hudi Cleaner
+
+To deal with the mentioned scenario, lets understand the  different cleaning policies that Hudi offers and the corresponding properties that need to be configured. Options are available to schedule cleaning asynchronously or synchronously. Before going into more details, we would like to explain a few underlying concepts:
+
+ - **Hudi base file**: Columnar file which consists of final data after compaction. A base file’s name follows the following naming convention: `<fileId>_<writeToken>_<instantTime>.parquet`. In subsequent writes of this file, file id remains the same and commit time gets updated to show the latest version. This also implies any particular version of a record, given its partition path, can be uniquely located using the file id and instant time. 
+ - **File slice**: A file slice consists of the base file and any log files consisting of the delta, in case of MERGE_ON_READ table type.
+ - **Hudi File Group**: Any file group in Hudi is uniquely identified by the partition path and the  file id that the files in this group have as part of their name. A file group consists of all the file slices in a particular partition path. Also any partition path can have multiple file groups.
+
+### Cleaning Policies
+
+Hudi cleaner currently supports below cleaning policies:
+
+ - **KEEP_LATEST_COMMITS**: This is the default policy. This is a temporal cleaning policy that ensures the effect of having lookback into all the changes that happened in the last X commits. Suppose a writer ingesting data  into a Hudi dataset every 30 minutes and the longest running query can take 5 hours to finish, then the user should retain atleast the last 10 commits. With such a configuration, we ensure that the oldest version of a file is kept on disk for at least 5 hours, thereby preventing the longest running query from failing at any point in time. Incremental cleaning is also possible using this policy.
+ - **KEEP_LATEST_FILE_VERSIONS**: This is a static numeric policy that has the effect of keeping N number of file versions irrespective of time. This policy is use-ful when it is known how many MAX versions of the file does one want to keep at any given time. To achieve the same behaviour as before of preventing long running queries from failing, one should do their calculations based on data patterns. Alternatively, this policy is also useful if a user just wants to maintain 1 latest version of the file.
+
+### Examples
+
+Suppose a user uses the below configs for cleaning:
+
+```java
+hoodie.cleaner.policy=KEEP_LATEST_COMMITS
+hoodie.cleaner.commits.retained=10
+```
+
+Cleaner selects the versions of files to be cleaned by taking care of the following:
+
+ - Latest version of a file should not be cleaned.
+ - The commit times of the last 10 (configured) + 1 commits are determined. One extra commit is included because the time window for retaining commits is essentially equal to the longest query run time. So if the longest query takes 5 hours to finish, and ingestion happens every 30 minutes, you need to retain last 10 commits since 10*30 = 300 (5 hours). At this point of time, the longest query can still be using files written in 11th commit in reverse order.  Now for any file group, only those file slices are scheduled for cleaning which are not savepointed (another Hudi table service) and whose commit time is less than the 11th commit in reverse order.
+
+Suppose a user uses the below configs for cleaning:
+
+```java
+hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS
+hoodie.cleaner.fileversions.retained=2
+```
+
+Cleaner does the following:
+
+ - For any file group, last 2 versions (including any for pending compaction) of file slices are kept and the rest are scheduled for cleaning.
+
+### Configurations
+
+You can find the details about all the possible configurations along with the default values [here](https://hudi.apache.org/docs/configurations.html#compaction-configs).
+
+### Run command

Review comment:
       I think it will be useful to add a section on how to enable async cleaner in HoodieDeltaStreamer using configs




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a change in pull request #2967: Added blog for Hudi cleaner service

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on a change in pull request #2967:
URL: https://github.com/apache/hudi/pull/2967#discussion_r637005303



##########
File path: docs/_posts/2021-05-19-employing-right-configurations-for-hudi-cleaner.md
##########
@@ -0,0 +1,77 @@
+---
+title: "Employing correct configurations for Hudi's cleaner table service"
+excerpt: "Achieving isolation between Hudi writer and readers using `HoodieCleaner.java`"
+author: pratyakshsharma
+category: blog
+---
+
+Apache Hudi provides snapshot isolation between writers and readers. This is made possible by Hudi’s MVCC concurrency model. In this blog, we will explain how to employ the right configurations to manage multiple file versions. Furthermore, we will discuss mechanisms available to users generating Hudi tables on how to maintain just the required number of old file versions so that long running readers do not fail. 
+
+### Reclaiming space and bounding your data lake growth
+
+Hudi provides different table management services to be able to manage your tables on the data lake. One of these services is called the **Cleaner**. As you write more data to your table, for every batch of updates received, Hudi can either generate a new version of the data file with updates applied to records (COPY_ON_WRITE) or write these delta updates to a log file, avoiding rewriting newer version of an existing file (MERGE_ON_READ). In such situations, depending on the frequency of your updates, the number of file versions of log files can grow indefinitely. If your use-cases do not require keeping an infinite history of these versions, it is imperative to have a process that reclaims older versions of the data. This is Hudi’s cleaner service.
+
+### Problem Statement
+
+In a data lake architecture, it is a very common scenario to have readers and writers concurrently accessing the same table. As the Hudi cleaner service periodically reclaims older file versions, scenarios arise where a long running query might be accessing a file version that is deemed to be reclaimed by the cleaner. Here, we need to employ the correct configs to ensure readers (aka queries) don’t fail.
+
+### Deeper dive into Hudi Cleaner
+
+To deal with the mentioned scenario, lets understand the  different cleaning policies that Hudi offers and the corresponding properties that need to be configured. Options are available to schedule cleaning asynchronously or synchronously. Before going into more details, we would like to explain a few underlying concepts:
+
+ - **Hudi base file**: Columnar file which consists of final data after compaction. A base file’s name follows the following naming convention: `<fileId>_<writeToken>_<instantTime>.parquet`. In subsequent writes of this file, file id remains the same and commit time gets updated to show the latest version. This also implies any particular version of a record, given its partition path, can be uniquely located using the file id and instant time. 
+ - **File slice**: A file slice consists of the base file and any log files consisting of the delta, in case of MERGE_ON_READ table type.
+ - **Hudi File Group**: Any file group in Hudi is uniquely identified by the partition path and the  file id that the files in this group have as part of their name. A file group consists of all the file slices in a particular partition path. Also any partition path can have multiple file groups.
+
+### Cleaning Policies
+
+Hudi cleaner currently supports below cleaning policies:
+
+ - **KEEP_LATEST_COMMITS**: This is the default policy. This is a temporal cleaning policy that ensures the effect of having lookback into all the changes that happened in the last X commits. Suppose a writer ingesting data  into a Hudi dataset every 30 minutes and the longest running query can take 5 hours to finish, then the user should retain atleast the last 10 commits. With such a configuration, we ensure that the oldest version of a file is kept on disk for at least 5 hours, thereby preventing the longest running query from failing at any point in time. Incremental cleaning is also possible using this policy.
+ - **KEEP_LATEST_FILE_VERSIONS**: This is a static numeric policy that has the effect of keeping N number of file versions irrespective of time. This policy is use-ful when it is known how many MAX versions of the file does one want to keep at any given time. To achieve the same behaviour as before of preventing long running queries from failing, one should do their calculations based on data patterns. Alternatively, this policy is also useful if a user just wants to maintain 1 latest version of the file.
+
+### Examples
+
+Suppose a user uses the below configs for cleaning:
+
+```java
+hoodie.cleaner.policy=KEEP_LATEST_COMMITS
+hoodie.cleaner.commits.retained=10
+```
+
+Cleaner selects the versions of files to be cleaned by taking care of the following:
+
+ - Latest version of a file should not be cleaned.
+ - The commit times of the last 10 (configured) + 1 commits are determined. One extra commit is included because the time window for retaining commits is essentially equal to the longest query run time. So if the longest query takes 5 hours to finish, and ingestion happens every 30 minutes, you need to retain last 10 commits since 10*30 = 300 (5 hours). At this point of time, the longest query can still be using files written in 11th commit in reverse order.  Now for any file group, only those file slices are scheduled for cleaning which are not savepointed (another Hudi table service) and whose commit time is less than the 11th commit in reverse order.
+
+Suppose a user uses the below configs for cleaning:
+
+```java
+hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS
+hoodie.cleaner.fileversions.retained=2
+```
+
+Cleaner does the following:
+
+ - For any file group, last 2 versions (including any for pending compaction) of file slices are kept and the rest are scheduled for cleaning.

Review comment:
       I feel it would be nice if we can give more details here. Even for me it took me a while to understand the details when I first bumped into something regarding cleaning policy.
   May be, show an example where there are 5 commits, where in each commit different file groups are touched. And show a timeline of cleaning events on every commit. If not, first time readers might wonder how is this different from previous policy. main crux here is that not every file group will be touched in every commit. 
   Also, you could considering doing the same w/ previous policy too since some file groups might have more file versions while others might have less. 

##########
File path: docs/_posts/2021-05-19-employing-right-configurations-for-hudi-cleaner.md
##########
@@ -0,0 +1,77 @@
+---
+title: "Employing correct configurations for Hudi's cleaner table service"
+excerpt: "Achieving isolation between Hudi writer and readers using `HoodieCleaner.java`"
+author: pratyakshsharma
+category: blog
+---
+
+Apache Hudi provides snapshot isolation between writers and readers. This is made possible by Hudi’s MVCC concurrency model. In this blog, we will explain how to employ the right configurations to manage multiple file versions. Furthermore, we will discuss mechanisms available to users generating Hudi tables on how to maintain just the required number of old file versions so that long running readers do not fail. 
+
+### Reclaiming space and bounding your data lake growth
+
+Hudi provides different table management services to be able to manage your tables on the data lake. One of these services is called the **Cleaner**. As you write more data to your table, for every batch of updates received, Hudi can either generate a new version of the data file with updates applied to records (COPY_ON_WRITE) or write these delta updates to a log file, avoiding rewriting newer version of an existing file (MERGE_ON_READ). In such situations, depending on the frequency of your updates, the number of file versions of log files can grow indefinitely. If your use-cases do not require keeping an infinite history of these versions, it is imperative to have a process that reclaims older versions of the data. This is Hudi’s cleaner service.
+
+### Problem Statement
+
+In a data lake architecture, it is a very common scenario to have readers and writers concurrently accessing the same table. As the Hudi cleaner service periodically reclaims older file versions, scenarios arise where a long running query might be accessing a file version that is deemed to be reclaimed by the cleaner. Here, we need to employ the correct configs to ensure readers (aka queries) don’t fail.
+
+### Deeper dive into Hudi Cleaner
+
+To deal with the mentioned scenario, lets understand the  different cleaning policies that Hudi offers and the corresponding properties that need to be configured. Options are available to schedule cleaning asynchronously or synchronously. Before going into more details, we would like to explain a few underlying concepts:
+
+ - **Hudi base file**: Columnar file which consists of final data after compaction. A base file’s name follows the following naming convention: `<fileId>_<writeToken>_<instantTime>.parquet`. In subsequent writes of this file, file id remains the same and commit time gets updated to show the latest version. This also implies any particular version of a record, given its partition path, can be uniquely located using the file id and instant time. 
+ - **File slice**: A file slice consists of the base file and any log files consisting of the delta, in case of MERGE_ON_READ table type.
+ - **Hudi File Group**: Any file group in Hudi is uniquely identified by the partition path and the  file id that the files in this group have as part of their name. A file group consists of all the file slices in a particular partition path. Also any partition path can have multiple file groups.
+
+### Cleaning Policies
+
+Hudi cleaner currently supports below cleaning policies:
+
+ - **KEEP_LATEST_COMMITS**: This is the default policy. This is a temporal cleaning policy that ensures the effect of having lookback into all the changes that happened in the last X commits. Suppose a writer ingesting data  into a Hudi dataset every 30 minutes and the longest running query can take 5 hours to finish, then the user should retain atleast the last 10 commits. With such a configuration, we ensure that the oldest version of a file is kept on disk for at least 5 hours, thereby preventing the longest running query from failing at any point in time. Incremental cleaning is also possible using this policy.
+ - **KEEP_LATEST_FILE_VERSIONS**: This is a static numeric policy that has the effect of keeping N number of file versions irrespective of time. This policy is use-ful when it is known how many MAX versions of the file does one want to keep at any given time. To achieve the same behaviour as before of preventing long running queries from failing, one should do their calculations based on data patterns. Alternatively, this policy is also useful if a user just wants to maintain 1 latest version of the file.
+
+### Examples
+
+Suppose a user uses the below configs for cleaning:
+
+```java
+hoodie.cleaner.policy=KEEP_LATEST_COMMITS
+hoodie.cleaner.commits.retained=10
+```
+
+Cleaner selects the versions of files to be cleaned by taking care of the following:
+
+ - Latest version of a file should not be cleaned.
+ - The commit times of the last 10 (configured) + 1 commits are determined. One extra commit is included because the time window for retaining commits is essentially equal to the longest query run time. So if the longest query takes 5 hours to finish, and ingestion happens every 30 minutes, you need to retain last 10 commits since 10*30 = 300 (5 hours). At this point of time, the longest query can still be using files written in 11th commit in reverse order.  Now for any file group, only those file slices are scheduled for cleaning which are not savepointed (another Hudi table service) and whose commit time is less than the 11th commit in reverse order.
+
+Suppose a user uses the below configs for cleaning:
+
+```java
+hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS
+hoodie.cleaner.fileversions.retained=2
+```
+
+Cleaner does the following:
+
+ - For any file group, last 2 versions (including any for pending compaction) of file slices are kept and the rest are scheduled for cleaning.
+
+### Configurations
+
+You can find the details about all the possible configurations along with the default values [here](https://hudi.apache.org/docs/configurations.html#compaction-configs).
+
+### Run command
+
+Hudi's cleaner table service can be run as a separate process or along with your data ingestion. As mentioned earlier, it basically cleans up any stale/old files lying around. In case you want to run it along with ingesting data, configs are available which enable you to run it in [parallel or in sync](https://hudi.apache.org/docs/configurations.html#withAsyncClean). You can use the below command for running the cleaner independently:

Review comment:
       I guess we can call out we have hudi-cli capability as well. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] n3nash commented on a change in pull request #2967: Added blog for Hudi cleaner service

Posted by GitBox <gi...@apache.org>.
n3nash commented on a change in pull request #2967:
URL: https://github.com/apache/hudi/pull/2967#discussion_r636319318



##########
File path: docs/_posts/2021-05-19-employing-right-configurations-for-hudi-cleaner.md
##########
@@ -0,0 +1,77 @@
+---
+title: "Employing correct configurations for Hudi's cleaner table service"
+excerpt: "Achieving isolation between Hudi writer and readers using `HoodieCleaner.java`"
+author: pratyakshsharma
+category: blog
+---
+
+Apache Hudi provides snapshot isolation between writers and readers. This is made possible by Hudi’s MVCC concurrency model. In this blog, we will explain how to employ the right configurations to manage multiple file versions. Furthermore, we will discuss mechanisms available to users generating Hudi tables on how to maintain just the required number of old file versions so that long running readers do not fail. 
+
+### Reclaiming space and bounding your data lake growth
+
+Hudi provides different table management services to be able to manage your tables on the data lake. One of these services is called the **Cleaner**. As you write more data to your table, for every batch of updates received, Hudi can either generate a new version of the data file with updates applied to records (COPY_ON_WRITE) or write these delta updates to a log file, avoiding rewriting newer version of an existing file (MERGE_ON_READ). In such situations, depending on the frequency of your updates, the number of file versions of log files can grow indefinitely. If your use-cases do not require keeping an infinite history of these versions, it is imperative to have a process that reclaims older versions of the data. This is Hudi’s cleaner service.
+
+### Problem Statement
+
+In a data lake architecture, it is a very common scenario to have readers and writers concurrently accessing the same table. As the Hudi cleaner service periodically reclaims older file versions, scenarios arise where a long running query might be accessing a file version that is deemed to be reclaimed by the cleaner. Here, we need to employ the correct configs to ensure readers (aka queries) don’t fail.
+
+### Deeper dive into Hudi Cleaner
+
+To deal with the mentioned scenario, lets understand the  different cleaning policies that Hudi offers and the corresponding properties that need to be configured. Options are available to schedule cleaning asynchronously or synchronously. Before going into more details, we would like to explain a few underlying concepts:
+
+ - **Hudi base file**: Columnar file which consists of final data after compaction. A base file’s name follows the following naming convention: `<fileId>_<writeToken>_<instantTime>.parquet`. In subsequent writes of this file, file id remains the same and commit time gets updated to show the latest version. This also implies any particular version of a record, given its partition path, can be uniquely located using the file id and instant time. 
+ - **File slice**: A file slice consists of the base file and any log files consisting of the delta, in case of MERGE_ON_READ table type.
+ - **Hudi File Group**: Any file group in Hudi is uniquely identified by the partition path and the  file id that the files in this group have as part of their name. A file group consists of all the file slices in a particular partition path. Also any partition path can have multiple file groups.
+
+### Cleaning Policies
+
+Hudi cleaner currently supports below cleaning policies:
+
+ - **KEEP_LATEST_COMMITS**: This is the default policy. This is a temporal cleaning policy that ensures the effect of having lookback into all the changes that happened in the last X commits. Suppose a writer ingesting data  into a Hudi dataset every 30 minutes and the longest running query can take 5 hours to finish, then the user should retain atleast the last 10 commits. With such a configuration, we ensure that the oldest version of a file is kept on disk for at least 5 hours, thereby preventing the longest running query from failing at any point in time. Incremental cleaning is also possible using this policy.

Review comment:
       Nit : "Suppose a writer ingesting data  into a Hudi dataset every 30 minutes" -> "Suppose a writer **is** ingesting data  into a Hudi dataset every 30 minutes"




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] pratyakshsharma commented on a change in pull request #2967: Added blog for Hudi cleaner service

Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on a change in pull request #2967:
URL: https://github.com/apache/hudi/pull/2967#discussion_r640915944



##########
File path: docs/_posts/2021-05-19-employing-right-configurations-for-hudi-cleaner.md
##########
@@ -0,0 +1,77 @@
+---
+title: "Employing correct configurations for Hudi's cleaner table service"
+excerpt: "Achieving isolation between Hudi writer and readers using `HoodieCleaner.java`"
+author: pratyakshsharma
+category: blog
+---
+
+Apache Hudi provides snapshot isolation between writers and readers. This is made possible by Hudi’s MVCC concurrency model. In this blog, we will explain how to employ the right configurations to manage multiple file versions. Furthermore, we will discuss mechanisms available to users generating Hudi tables on how to maintain just the required number of old file versions so that long running readers do not fail. 
+
+### Reclaiming space and bounding your data lake growth
+
+Hudi provides different table management services to be able to manage your tables on the data lake. One of these services is called the **Cleaner**. As you write more data to your table, for every batch of updates received, Hudi can either generate a new version of the data file with updates applied to records (COPY_ON_WRITE) or write these delta updates to a log file, avoiding rewriting newer version of an existing file (MERGE_ON_READ). In such situations, depending on the frequency of your updates, the number of file versions of log files can grow indefinitely. If your use-cases do not require keeping an infinite history of these versions, it is imperative to have a process that reclaims older versions of the data. This is Hudi’s cleaner service.
+
+### Problem Statement
+
+In a data lake architecture, it is a very common scenario to have readers and writers concurrently accessing the same table. As the Hudi cleaner service periodically reclaims older file versions, scenarios arise where a long running query might be accessing a file version that is deemed to be reclaimed by the cleaner. Here, we need to employ the correct configs to ensure readers (aka queries) don’t fail.
+
+### Deeper dive into Hudi Cleaner
+
+To deal with the mentioned scenario, lets understand the  different cleaning policies that Hudi offers and the corresponding properties that need to be configured. Options are available to schedule cleaning asynchronously or synchronously. Before going into more details, we would like to explain a few underlying concepts:
+
+ - **Hudi base file**: Columnar file which consists of final data after compaction. A base file’s name follows the following naming convention: `<fileId>_<writeToken>_<instantTime>.parquet`. In subsequent writes of this file, file id remains the same and commit time gets updated to show the latest version. This also implies any particular version of a record, given its partition path, can be uniquely located using the file id and instant time. 
+ - **File slice**: A file slice consists of the base file and any log files consisting of the delta, in case of MERGE_ON_READ table type.
+ - **Hudi File Group**: Any file group in Hudi is uniquely identified by the partition path and the  file id that the files in this group have as part of their name. A file group consists of all the file slices in a particular partition path. Also any partition path can have multiple file groups.
+
+### Cleaning Policies
+
+Hudi cleaner currently supports below cleaning policies:
+
+ - **KEEP_LATEST_COMMITS**: This is the default policy. This is a temporal cleaning policy that ensures the effect of having lookback into all the changes that happened in the last X commits. Suppose a writer ingesting data  into a Hudi dataset every 30 minutes and the longest running query can take 5 hours to finish, then the user should retain atleast the last 10 commits. With such a configuration, we ensure that the oldest version of a file is kept on disk for at least 5 hours, thereby preventing the longest running query from failing at any point in time. Incremental cleaning is also possible using this policy.
+ - **KEEP_LATEST_FILE_VERSIONS**: This is a static numeric policy that has the effect of keeping N number of file versions irrespective of time. This policy is use-ful when it is known how many MAX versions of the file does one want to keep at any given time. To achieve the same behaviour as before of preventing long running queries from failing, one should do their calculations based on data patterns. Alternatively, this policy is also useful if a user just wants to maintain 1 latest version of the file.
+
+### Examples
+
+Suppose a user uses the below configs for cleaning:
+
+```java
+hoodie.cleaner.policy=KEEP_LATEST_COMMITS
+hoodie.cleaner.commits.retained=10
+```
+
+Cleaner selects the versions of files to be cleaned by taking care of the following:
+
+ - Latest version of a file should not be cleaned.
+ - The commit times of the last 10 (configured) + 1 commits are determined. One extra commit is included because the time window for retaining commits is essentially equal to the longest query run time. So if the longest query takes 5 hours to finish, and ingestion happens every 30 minutes, you need to retain last 10 commits since 10*30 = 300 (5 hours). At this point of time, the longest query can still be using files written in 11th commit in reverse order.  Now for any file group, only those file slices are scheduled for cleaning which are not savepointed (another Hudi table service) and whose commit time is less than the 11th commit in reverse order.
+
+Suppose a user uses the below configs for cleaning:
+
+```java
+hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS
+hoodie.cleaner.fileversions.retained=2
+```
+
+Cleaner does the following:
+
+ - For any file group, last 2 versions (including any for pending compaction) of file slices are kept and the rest are scheduled for cleaning.
+
+### Configurations
+
+You can find the details about all the possible configurations along with the default values [here](https://hudi.apache.org/docs/configurations.html#compaction-configs).
+
+### Run command

Review comment:
       Sure, will add it. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] n3nash commented on pull request #2967: Added blog for Hudi cleaner service

Posted by GitBox <gi...@apache.org>.
n3nash commented on pull request #2967:
URL: https://github.com/apache/hudi/pull/2967#issuecomment-859327106


   @pratyakshsharma I think there's 1 last comment left, let's address that and ship this!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on pull request #2967: Added blog for Hudi cleaner service

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on pull request #2967:
URL: https://github.com/apache/hudi/pull/2967#issuecomment-860282110


   awesome, thanks for your contribution. This will definitely benefit the community. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] pratyakshsharma commented on pull request #2967: Added blog for Hudi cleaner service

Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on pull request #2967:
URL: https://github.com/apache/hudi/pull/2967#issuecomment-853878207


   <img width="663" alt="final-cleaner-1" src="https://user-images.githubusercontent.com/30863489/120653442-4c5d6c80-c49e-11eb-8881-30366627206b.png">
   <img width="661" alt="final-cleaner-2" src="https://user-images.githubusercontent.com/30863489/120654337-3ac89480-c49f-11eb-83ee-fb08e0627831.png">
   <img width="671" alt="final-cleaner-3" src="https://user-images.githubusercontent.com/30863489/120654397-487e1a00-c49f-11eb-96c9-b30f38730019.png">
   <img width="663" alt="final-cleaner-4" src="https://user-images.githubusercontent.com/30863489/120654515-68154280-c49f-11eb-8f57-e72e0e4c0199.png">
   <img width="659" alt="final-cleaner-5" src="https://user-images.githubusercontent.com/30863489/120654549-6fd4e700-c49f-11eb-8b45-fa91cf14a045.png">
   
   @nsivabalan Please take a look. :) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a change in pull request #2967: Added blog for Hudi cleaner service

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on a change in pull request #2967:
URL: https://github.com/apache/hudi/pull/2967#discussion_r641926942



##########
File path: docs/_posts/2021-05-28-employing-right-configurations-for-hudi-cleaner.md
##########
@@ -0,0 +1,106 @@
+---
+title: "Employing correct configurations for Hudi's cleaner table service"
+excerpt: "Ensuring isolation between Hudi writers and readers using `HoodieCleaner.java`"
+author: pratyakshsharma
+category: blog
+---
+
+Apache Hudi provides snapshot isolation between writers and readers. This is made possible by Hudi’s MVCC concurrency model. In this blog, we will explain how to employ the right configurations to manage multiple file versions. Furthermore, we will discuss mechanisms available to users on how to maintain just the required number of old file versions so that long running readers do not fail. 
+
+### Reclaiming space and keeping your data lake storage costs in check
+
+Hudi provides different table management services to be able to manage your tables on the data lake. One of these services is called the **Cleaner**. As you write more data to your table, for every batch of updates received, Hudi can either generate a new version of the data file with updates applied to records (COPY_ON_WRITE) or write these delta updates to a log file, avoiding rewriting newer version of an existing file (MERGE_ON_READ). In such situations, depending on the frequency of your updates, the number of file versions of log files can grow indefinitely. If your use-cases do not require keeping an infinite history of these versions, it is imperative to have a process that reclaims older versions of the data. This is Hudi’s cleaner service.
+
+### Problem Statement
+
+In a data lake architecture, it is a very common scenario to have readers and writers concurrently accessing the same table. As the Hudi cleaner service periodically reclaims older file versions, scenarios arise where a long running query might be accessing a file version that is deemed to be reclaimed by the cleaner. Here, we need to employ the correct configs to ensure readers (aka queries) don’t fail.
+
+### Deeper dive into Hudi Cleaner
+
+To deal with the mentioned scenario, lets understand the  different cleaning policies that Hudi offers and the corresponding properties that need to be configured. Options are available to schedule cleaning asynchronously or synchronously. Before going into more details, we would like to explain a few underlying concepts:
+
+ - **Hudi base file**: Columnar file which consists of final data after compaction. A base file’s name follows the following naming convention: `<fileId>_<writeToken>_<instantTime>.parquet`. In subsequent writes of this file, file id remains the same and commit time gets updated to show the latest version. This also implies any particular version of a record, given its partition path, can be uniquely located using the file id and instant time. 
+ - **File slice**: A file slice consists of the base file and any log files consisting of the delta, in case of MERGE_ON_READ table type.
+ - **Hudi File Group**: Any file group in Hudi is uniquely identified by the partition path and the  file id that the files in this group have as part of their name. A file group consists of all the file slices in a particular partition path. Also any partition path can have multiple file groups.
+
+### Cleaning Policies
+
+Hudi cleaner currently supports below cleaning policies:
+
+ - **KEEP_LATEST_COMMITS**: This is the default policy. This is a temporal cleaning policy that ensures the effect of having lookback into all the changes that happened in the last X commits. Suppose a writer is ingesting data  into a Hudi dataset every 30 minutes and the longest running query can take 5 hours to finish, then the user should retain atleast the last 10 commits. With such a configuration, we ensure that the oldest version of a file is kept on disk for at least 5 hours, thereby preventing the longest running query from failing at any point in time. Incremental cleaning is also possible using this policy.
+ - **KEEP_LATEST_FILE_VERSIONS**: This policy has the effect of keeping N number of file versions irrespective of time. This policy is useful when it is known how many MAX versions of the file does one want to keep at any given time. To achieve the same behaviour as before of preventing long running queries from failing, one should do their calculations based on data patterns. Alternatively, this policy is also useful if a user just wants to maintain 1 latest version of the file.
+
+### Examples
+
+Suppose a user is ingesting data into a hudi dataset of type COPY_ON_WRITE every 30 minutes as shown below:
+
+![Initial timeline](/assets/images/blog/hoodie-cleaner/Initial_timeline.png)
+_Figure1: Incoming records getting ingested into a hudi dataset every 30 minutes_
+
+The figure shows a particular partition on DFS where commits and corresponding file versions are color coded. 4 different file groups are created in this partition as depicted by fileId1, fileId2, fileId3 and fileId4. File group corresponding to fileId2 has records ingested from all the 5 commits, while the group corresponding to fileId4 has records from the latest 2 commits only.
+
+Suppose the user uses the below configs for cleaning:
+
+```java
+hoodie.cleaner.policy=KEEP_LATEST_COMMITS
+hoodie.cleaner.commits.retained=2
+```
+
+Cleaner selects the versions of files to be cleaned by taking care of the following:
+
+ - Latest version of a file should not be cleaned.
+ - The commit times of the last 2 (configured) + 1 commits are determined. In Figure1, `commit 10:30` and `commit 10:00` correspond to the latest 2 commits in the timeline. One extra commit is included because the time window for retaining commits is essentially equal to the longest query run time. So if the longest query takes 1 hour to finish, and ingestion happens every 30 minutes, you need to retain last 2 commits since 2*30 = 60 (1 hour). At this point of time, the longest query can still be using files written in 3rd commit in reverse order. Essentially this means if a query started executing after `commit 9:30`, it will still be running when clean action is triggered right after `commit 10:30` as in Figure2. 
+ -  Now for any file group, only those file slices are scheduled for cleaning which are not savepointed (another Hudi table service) and whose commit time is less than the 3rd commit (`commit 9:30` in figure below) in reverse order.
+
+![Retain latest commits](/assets/images/blog/hoodie-cleaner/Retain_latest_commits.png)
+_Figure2: Files corresponding to latest 3 commits are retained_
+
+Now, suppose he uses the below configs for cleaning:

Review comment:
       minor. replace "he" with user :) 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] pratyakshsharma commented on a change in pull request #2967: Added blog for Hudi cleaner service

Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on a change in pull request #2967:
URL: https://github.com/apache/hudi/pull/2967#discussion_r641319609



##########
File path: docs/_posts/2021-05-19-employing-right-configurations-for-hudi-cleaner.md
##########
@@ -0,0 +1,77 @@
+---
+title: "Employing correct configurations for Hudi's cleaner table service"
+excerpt: "Achieving isolation between Hudi writer and readers using `HoodieCleaner.java`"
+author: pratyakshsharma
+category: blog
+---
+
+Apache Hudi provides snapshot isolation between writers and readers. This is made possible by Hudi’s MVCC concurrency model. In this blog, we will explain how to employ the right configurations to manage multiple file versions. Furthermore, we will discuss mechanisms available to users generating Hudi tables on how to maintain just the required number of old file versions so that long running readers do not fail. 
+
+### Reclaiming space and bounding your data lake growth
+
+Hudi provides different table management services to be able to manage your tables on the data lake. One of these services is called the **Cleaner**. As you write more data to your table, for every batch of updates received, Hudi can either generate a new version of the data file with updates applied to records (COPY_ON_WRITE) or write these delta updates to a log file, avoiding rewriting newer version of an existing file (MERGE_ON_READ). In such situations, depending on the frequency of your updates, the number of file versions of log files can grow indefinitely. If your use-cases do not require keeping an infinite history of these versions, it is imperative to have a process that reclaims older versions of the data. This is Hudi’s cleaner service.
+
+### Problem Statement
+
+In a data lake architecture, it is a very common scenario to have readers and writers concurrently accessing the same table. As the Hudi cleaner service periodically reclaims older file versions, scenarios arise where a long running query might be accessing a file version that is deemed to be reclaimed by the cleaner. Here, we need to employ the correct configs to ensure readers (aka queries) don’t fail.
+
+### Deeper dive into Hudi Cleaner
+
+To deal with the mentioned scenario, lets understand the  different cleaning policies that Hudi offers and the corresponding properties that need to be configured. Options are available to schedule cleaning asynchronously or synchronously. Before going into more details, we would like to explain a few underlying concepts:
+
+ - **Hudi base file**: Columnar file which consists of final data after compaction. A base file’s name follows the following naming convention: `<fileId>_<writeToken>_<instantTime>.parquet`. In subsequent writes of this file, file id remains the same and commit time gets updated to show the latest version. This also implies any particular version of a record, given its partition path, can be uniquely located using the file id and instant time. 
+ - **File slice**: A file slice consists of the base file and any log files consisting of the delta, in case of MERGE_ON_READ table type.
+ - **Hudi File Group**: Any file group in Hudi is uniquely identified by the partition path and the  file id that the files in this group have as part of their name. A file group consists of all the file slices in a particular partition path. Also any partition path can have multiple file groups.
+
+### Cleaning Policies
+
+Hudi cleaner currently supports below cleaning policies:
+
+ - **KEEP_LATEST_COMMITS**: This is the default policy. This is a temporal cleaning policy that ensures the effect of having lookback into all the changes that happened in the last X commits. Suppose a writer ingesting data  into a Hudi dataset every 30 minutes and the longest running query can take 5 hours to finish, then the user should retain atleast the last 10 commits. With such a configuration, we ensure that the oldest version of a file is kept on disk for at least 5 hours, thereby preventing the longest running query from failing at any point in time. Incremental cleaning is also possible using this policy.
+ - **KEEP_LATEST_FILE_VERSIONS**: This is a static numeric policy that has the effect of keeping N number of file versions irrespective of time. This policy is use-ful when it is known how many MAX versions of the file does one want to keep at any given time. To achieve the same behaviour as before of preventing long running queries from failing, one should do their calculations based on data patterns. Alternatively, this policy is also useful if a user just wants to maintain 1 latest version of the file.
+
+### Examples
+
+Suppose a user uses the below configs for cleaning:
+
+```java
+hoodie.cleaner.policy=KEEP_LATEST_COMMITS
+hoodie.cleaner.commits.retained=10
+```
+
+Cleaner selects the versions of files to be cleaned by taking care of the following:
+
+ - Latest version of a file should not be cleaned.
+ - The commit times of the last 10 (configured) + 1 commits are determined. One extra commit is included because the time window for retaining commits is essentially equal to the longest query run time. So if the longest query takes 5 hours to finish, and ingestion happens every 30 minutes, you need to retain last 10 commits since 10*30 = 300 (5 hours). At this point of time, the longest query can still be using files written in 11th commit in reverse order.  Now for any file group, only those file slices are scheduled for cleaning which are not savepointed (another Hudi table service) and whose commit time is less than the 11th commit in reverse order.
+
+Suppose a user uses the below configs for cleaning:
+
+```java
+hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS
+hoodie.cleaner.fileversions.retained=2
+```
+
+Cleaner does the following:
+
+ - For any file group, last 2 versions (including any for pending compaction) of file slices are kept and the rest are scheduled for cleaning.
+
+### Configurations
+
+You can find the details about all the possible configurations along with the default values [here](https://hudi.apache.org/docs/configurations.html#compaction-configs).
+
+### Run command

Review comment:
       done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] pratyakshsharma commented on pull request #2967: Added blog for Hudi cleaner service

Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on pull request #2967:
URL: https://github.com/apache/hudi/pull/2967#issuecomment-850210814


   @n3nash @nsivabalan Please take a look. All the comments are addressed. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] pratyakshsharma commented on a change in pull request #2967: Added blog for Hudi cleaner service

Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on a change in pull request #2967:
URL: https://github.com/apache/hudi/pull/2967#discussion_r641327528



##########
File path: docs/_posts/2021-05-19-employing-right-configurations-for-hudi-cleaner.md
##########
@@ -0,0 +1,77 @@
+---
+title: "Employing correct configurations for Hudi's cleaner table service"
+excerpt: "Achieving isolation between Hudi writer and readers using `HoodieCleaner.java`"
+author: pratyakshsharma
+category: blog
+---
+
+Apache Hudi provides snapshot isolation between writers and readers. This is made possible by Hudi’s MVCC concurrency model. In this blog, we will explain how to employ the right configurations to manage multiple file versions. Furthermore, we will discuss mechanisms available to users generating Hudi tables on how to maintain just the required number of old file versions so that long running readers do not fail. 
+
+### Reclaiming space and bounding your data lake growth
+
+Hudi provides different table management services to be able to manage your tables on the data lake. One of these services is called the **Cleaner**. As you write more data to your table, for every batch of updates received, Hudi can either generate a new version of the data file with updates applied to records (COPY_ON_WRITE) or write these delta updates to a log file, avoiding rewriting newer version of an existing file (MERGE_ON_READ). In such situations, depending on the frequency of your updates, the number of file versions of log files can grow indefinitely. If your use-cases do not require keeping an infinite history of these versions, it is imperative to have a process that reclaims older versions of the data. This is Hudi’s cleaner service.
+
+### Problem Statement
+
+In a data lake architecture, it is a very common scenario to have readers and writers concurrently accessing the same table. As the Hudi cleaner service periodically reclaims older file versions, scenarios arise where a long running query might be accessing a file version that is deemed to be reclaimed by the cleaner. Here, we need to employ the correct configs to ensure readers (aka queries) don’t fail.
+
+### Deeper dive into Hudi Cleaner
+
+To deal with the mentioned scenario, lets understand the  different cleaning policies that Hudi offers and the corresponding properties that need to be configured. Options are available to schedule cleaning asynchronously or synchronously. Before going into more details, we would like to explain a few underlying concepts:
+
+ - **Hudi base file**: Columnar file which consists of final data after compaction. A base file’s name follows the following naming convention: `<fileId>_<writeToken>_<instantTime>.parquet`. In subsequent writes of this file, file id remains the same and commit time gets updated to show the latest version. This also implies any particular version of a record, given its partition path, can be uniquely located using the file id and instant time. 
+ - **File slice**: A file slice consists of the base file and any log files consisting of the delta, in case of MERGE_ON_READ table type.
+ - **Hudi File Group**: Any file group in Hudi is uniquely identified by the partition path and the  file id that the files in this group have as part of their name. A file group consists of all the file slices in a particular partition path. Also any partition path can have multiple file groups.
+
+### Cleaning Policies
+
+Hudi cleaner currently supports below cleaning policies:
+
+ - **KEEP_LATEST_COMMITS**: This is the default policy. This is a temporal cleaning policy that ensures the effect of having lookback into all the changes that happened in the last X commits. Suppose a writer ingesting data  into a Hudi dataset every 30 minutes and the longest running query can take 5 hours to finish, then the user should retain atleast the last 10 commits. With such a configuration, we ensure that the oldest version of a file is kept on disk for at least 5 hours, thereby preventing the longest running query from failing at any point in time. Incremental cleaning is also possible using this policy.
+ - **KEEP_LATEST_FILE_VERSIONS**: This is a static numeric policy that has the effect of keeping N number of file versions irrespective of time. This policy is use-ful when it is known how many MAX versions of the file does one want to keep at any given time. To achieve the same behaviour as before of preventing long running queries from failing, one should do their calculations based on data patterns. Alternatively, this policy is also useful if a user just wants to maintain 1 latest version of the file.
+
+### Examples
+
+Suppose a user uses the below configs for cleaning:
+
+```java
+hoodie.cleaner.policy=KEEP_LATEST_COMMITS
+hoodie.cleaner.commits.retained=10
+```
+
+Cleaner selects the versions of files to be cleaned by taking care of the following:
+
+ - Latest version of a file should not be cleaned.
+ - The commit times of the last 10 (configured) + 1 commits are determined. One extra commit is included because the time window for retaining commits is essentially equal to the longest query run time. So if the longest query takes 5 hours to finish, and ingestion happens every 30 minutes, you need to retain last 10 commits since 10*30 = 300 (5 hours). At this point of time, the longest query can still be using files written in 11th commit in reverse order.  Now for any file group, only those file slices are scheduled for cleaning which are not savepointed (another Hudi table service) and whose commit time is less than the 11th commit in reverse order.
+
+Suppose a user uses the below configs for cleaning:
+
+```java
+hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS
+hoodie.cleaner.fileversions.retained=2
+```
+
+Cleaner does the following:
+
+ - For any file group, last 2 versions (including any for pending compaction) of file slices are kept and the rest are scheduled for cleaning.
+
+### Configurations
+
+You can find the details about all the possible configurations along with the default values [here](https://hudi.apache.org/docs/configurations.html#compaction-configs).
+
+### Run command
+
+Hudi's cleaner table service can be run as a separate process or along with your data ingestion. As mentioned earlier, it basically cleans up any stale/old files lying around. In case you want to run it along with ingesting data, configs are available which enable you to run it in [parallel or in sync](https://hudi.apache.org/docs/configurations.html#withAsyncClean). You can use the below command for running the cleaner independently:

Review comment:
       Thank you for pointing this out as well. Added the commands.
   
   Also once we have cleans commands documented properly here - https://hudi.apache.org/docs/deployment.html#cli
   We can add a link to the page in this blog as well. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] pratyakshsharma commented on a change in pull request #2967: Added blog for Hudi cleaner service

Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on a change in pull request #2967:
URL: https://github.com/apache/hudi/pull/2967#discussion_r644803395



##########
File path: docs/_posts/2021-05-28-employing-right-configurations-for-hudi-cleaner.md
##########
@@ -0,0 +1,106 @@
+---
+title: "Employing correct configurations for Hudi's cleaner table service"
+excerpt: "Ensuring isolation between Hudi writers and readers using `HoodieCleaner.java`"
+author: pratyakshsharma
+category: blog
+---
+
+Apache Hudi provides snapshot isolation between writers and readers. This is made possible by Hudi’s MVCC concurrency model. In this blog, we will explain how to employ the right configurations to manage multiple file versions. Furthermore, we will discuss mechanisms available to users on how to maintain just the required number of old file versions so that long running readers do not fail. 
+
+### Reclaiming space and keeping your data lake storage costs in check
+
+Hudi provides different table management services to be able to manage your tables on the data lake. One of these services is called the **Cleaner**. As you write more data to your table, for every batch of updates received, Hudi can either generate a new version of the data file with updates applied to records (COPY_ON_WRITE) or write these delta updates to a log file, avoiding rewriting newer version of an existing file (MERGE_ON_READ). In such situations, depending on the frequency of your updates, the number of file versions of log files can grow indefinitely. If your use-cases do not require keeping an infinite history of these versions, it is imperative to have a process that reclaims older versions of the data. This is Hudi’s cleaner service.
+
+### Problem Statement
+
+In a data lake architecture, it is a very common scenario to have readers and writers concurrently accessing the same table. As the Hudi cleaner service periodically reclaims older file versions, scenarios arise where a long running query might be accessing a file version that is deemed to be reclaimed by the cleaner. Here, we need to employ the correct configs to ensure readers (aka queries) don’t fail.
+
+### Deeper dive into Hudi Cleaner
+
+To deal with the mentioned scenario, lets understand the  different cleaning policies that Hudi offers and the corresponding properties that need to be configured. Options are available to schedule cleaning asynchronously or synchronously. Before going into more details, we would like to explain a few underlying concepts:
+
+ - **Hudi base file**: Columnar file which consists of final data after compaction. A base file’s name follows the following naming convention: `<fileId>_<writeToken>_<instantTime>.parquet`. In subsequent writes of this file, file id remains the same and commit time gets updated to show the latest version. This also implies any particular version of a record, given its partition path, can be uniquely located using the file id and instant time. 
+ - **File slice**: A file slice consists of the base file and any log files consisting of the delta, in case of MERGE_ON_READ table type.
+ - **Hudi File Group**: Any file group in Hudi is uniquely identified by the partition path and the  file id that the files in this group have as part of their name. A file group consists of all the file slices in a particular partition path. Also any partition path can have multiple file groups.
+
+### Cleaning Policies
+
+Hudi cleaner currently supports below cleaning policies:
+
+ - **KEEP_LATEST_COMMITS**: This is the default policy. This is a temporal cleaning policy that ensures the effect of having lookback into all the changes that happened in the last X commits. Suppose a writer is ingesting data  into a Hudi dataset every 30 minutes and the longest running query can take 5 hours to finish, then the user should retain atleast the last 10 commits. With such a configuration, we ensure that the oldest version of a file is kept on disk for at least 5 hours, thereby preventing the longest running query from failing at any point in time. Incremental cleaning is also possible using this policy.
+ - **KEEP_LATEST_FILE_VERSIONS**: This policy has the effect of keeping N number of file versions irrespective of time. This policy is useful when it is known how many MAX versions of the file does one want to keep at any given time. To achieve the same behaviour as before of preventing long running queries from failing, one should do their calculations based on data patterns. Alternatively, this policy is also useful if a user just wants to maintain 1 latest version of the file.
+
+### Examples
+
+Suppose a user is ingesting data into a hudi dataset of type COPY_ON_WRITE every 30 minutes as shown below:
+
+![Initial timeline](/assets/images/blog/hoodie-cleaner/Initial_timeline.png)
+_Figure1: Incoming records getting ingested into a hudi dataset every 30 minutes_
+
+The figure shows a particular partition on DFS where commits and corresponding file versions are color coded. 4 different file groups are created in this partition as depicted by fileId1, fileId2, fileId3 and fileId4. File group corresponding to fileId2 has records ingested from all the 5 commits, while the group corresponding to fileId4 has records from the latest 2 commits only.
+
+Suppose the user uses the below configs for cleaning:
+
+```java
+hoodie.cleaner.policy=KEEP_LATEST_COMMITS
+hoodie.cleaner.commits.retained=2
+```
+
+Cleaner selects the versions of files to be cleaned by taking care of the following:
+
+ - Latest version of a file should not be cleaned.
+ - The commit times of the last 2 (configured) + 1 commits are determined. In Figure1, `commit 10:30` and `commit 10:00` correspond to the latest 2 commits in the timeline. One extra commit is included because the time window for retaining commits is essentially equal to the longest query run time. So if the longest query takes 1 hour to finish, and ingestion happens every 30 minutes, you need to retain last 2 commits since 2*30 = 60 (1 hour). At this point of time, the longest query can still be using files written in 3rd commit in reverse order. Essentially this means if a query started executing after `commit 9:30`, it will still be running when clean action is triggered right after `commit 10:30` as in Figure2. 
+ -  Now for any file group, only those file slices are scheduled for cleaning which are not savepointed (another Hudi table service) and whose commit time is less than the 3rd commit (`commit 9:30` in figure below) in reverse order.
+
+![Retain latest commits](/assets/images/blog/hoodie-cleaner/Retain_latest_commits.png)
+_Figure2: Files corresponding to latest 3 commits are retained_
+
+Now, suppose he uses the below configs for cleaning:

Review comment:
       Done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] pratyakshsharma commented on pull request #2967: Added blog for Hudi cleaner service

Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on pull request #2967:
URL: https://github.com/apache/hudi/pull/2967#issuecomment-853878207


   <img width="663" alt="final-cleaner-1" src="https://user-images.githubusercontent.com/30863489/120653442-4c5d6c80-c49e-11eb-8881-30366627206b.png">
   <img width="661" alt="final-cleaner-2" src="https://user-images.githubusercontent.com/30863489/120654337-3ac89480-c49f-11eb-83ee-fb08e0627831.png">
   <img width="671" alt="final-cleaner-3" src="https://user-images.githubusercontent.com/30863489/120654397-487e1a00-c49f-11eb-96c9-b30f38730019.png">
   <img width="663" alt="final-cleaner-4" src="https://user-images.githubusercontent.com/30863489/120654515-68154280-c49f-11eb-8f57-e72e0e4c0199.png">
   <img width="659" alt="final-cleaner-5" src="https://user-images.githubusercontent.com/30863489/120654549-6fd4e700-c49f-11eb-8b45-fa91cf14a045.png">
   
   @nsivabalan Please take a look. :) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on pull request #2967: Added blog for Hudi cleaner service

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on pull request #2967:
URL: https://github.com/apache/hudi/pull/2967#issuecomment-851627436


   cool, one minor suggestion in the images. Can you also label the min file slice commit time for each file group. its implicit by counting the no of file versions, but could be more explicit. 
   LGTM otherwise.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] pratyakshsharma commented on a change in pull request #2967: Added blog for Hudi cleaner service

Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on a change in pull request #2967:
URL: https://github.com/apache/hudi/pull/2967#discussion_r645990277



##########
File path: docs/_posts/2021-06-03-employing-right-configurations-for-hudi-cleaner.md
##########
@@ -0,0 +1,106 @@
+---
+title: "Employing correct configurations for Hudi's cleaner table service"
+excerpt: "Ensuring isolation between Hudi writers and readers using `HoodieCleaner.java`"
+author: pratyakshsharma
+category: blog
+---
+
+Apache Hudi provides snapshot isolation between writers and readers. This is made possible by Hudi’s MVCC concurrency model. In this blog, we will explain how to employ the right configurations to manage multiple file versions. Furthermore, we will discuss mechanisms available to users on how to maintain just the required number of old file versions so that long running readers do not fail. 
+
+### Reclaiming space and keeping your data lake storage costs in check
+
+Hudi provides different table management services to be able to manage your tables on the data lake. One of these services is called the **Cleaner**. As you write more data to your table, for every batch of updates received, Hudi can either generate a new version of the data file with updates applied to records (COPY_ON_WRITE) or write these delta updates to a log file, avoiding rewriting newer version of an existing file (MERGE_ON_READ). In such situations, depending on the frequency of your updates, the number of file versions of log files can grow indefinitely. If your use-cases do not require keeping an infinite history of these versions, it is imperative to have a process that reclaims older versions of the data. This is Hudi’s cleaner service.
+
+### Problem Statement
+
+In a data lake architecture, it is a very common scenario to have readers and writers concurrently accessing the same table. As the Hudi cleaner service periodically reclaims older file versions, scenarios arise where a long running query might be accessing a file version that is deemed to be reclaimed by the cleaner. Here, we need to employ the correct configs to ensure readers (aka queries) don’t fail.
+
+### Deeper dive into Hudi Cleaner
+
+To deal with the mentioned scenario, lets understand the  different cleaning policies that Hudi offers and the corresponding properties that need to be configured. Options are available to schedule cleaning asynchronously or synchronously. Before going into more details, we would like to explain a few underlying concepts:
+
+ - **Hudi base file**: Columnar file which consists of final data after compaction. A base file’s name follows the following naming convention: `<fileId>_<writeToken>_<instantTime>.parquet`. In subsequent writes of this file, file id remains the same and commit time gets updated to show the latest version. This also implies any particular version of a record, given its partition path, can be uniquely located using the file id and instant time. 
+ - **File slice**: A file slice consists of the base file and any log files consisting of the delta, in case of MERGE_ON_READ table type.
+ - **Hudi File Group**: Any file group in Hudi is uniquely identified by the partition path and the  file id that the files in this group have as part of their name. A file group consists of all the file slices in a particular partition path. Also any partition path can have multiple file groups.
+
+### Cleaning Policies
+
+Hudi cleaner currently supports below cleaning policies:
+
+ - **KEEP_LATEST_COMMITS**: This is the default policy. This is a temporal cleaning policy that ensures the effect of having lookback into all the changes that happened in the last X commits. Suppose a writer is ingesting data  into a Hudi dataset every 30 minutes and the longest running query can take 5 hours to finish, then the user should retain atleast the last 10 commits. With such a configuration, we ensure that the oldest version of a file is kept on disk for at least 5 hours, thereby preventing the longest running query from failing at any point in time. Incremental cleaning is also possible using this policy.
+ - **KEEP_LATEST_FILE_VERSIONS**: This policy has the effect of keeping N number of file versions irrespective of time. This policy is useful when it is known how many MAX versions of the file does one want to keep at any given time. To achieve the same behaviour as before of preventing long running queries from failing, one should do their calculations based on data patterns. Alternatively, this policy is also useful if a user just wants to maintain 1 latest version of the file.
+
+### Examples
+
+Suppose a user is ingesting data into a hudi dataset of type COPY_ON_WRITE every 30 minutes as shown below:
+
+![Initial timeline](/assets/images/blog/hoodie-cleaner/Initial_timeline.png)
+_Figure1: Incoming records getting ingested into a hudi dataset every 30 minutes_
+
+The figure shows a particular partition on DFS where commits and corresponding file versions are color coded. 4 different file groups are created in this partition as depicted by fileId1, fileId2, fileId3 and fileId4. File group corresponding to fileId2 has records ingested from all the 5 commits, while the group corresponding to fileId4 has records from the latest 2 commits only.

Review comment:
       @nsivabalan I have mentioned here that fileIds represent file groups. Does this solve what you are looking for?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on pull request #2967: Added blog for Hudi cleaner service

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on pull request #2967:
URL: https://github.com/apache/hudi/pull/2967#issuecomment-855224297


   sorry, one last comment. in the figure, instead of addressing as fielIds (fileId1, fileId2...), can we use fileGroup1 , fileGroup2 etc. A file group represents all files in that group. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] n3nash commented on a change in pull request #2967: Added blog for Hudi cleaner service

Posted by GitBox <gi...@apache.org>.
n3nash commented on a change in pull request #2967:
URL: https://github.com/apache/hudi/pull/2967#discussion_r636314876



##########
File path: docs/_posts/2021-05-19-employing-right-configurations-for-hudi-cleaner.md
##########
@@ -0,0 +1,77 @@
+---
+title: "Employing correct configurations for Hudi's cleaner table service"
+excerpt: "Achieving isolation between Hudi writer and readers using `HoodieCleaner.java`"

Review comment:
       Change to "Ensuring isolation between Hudi writers and readers using HoodieCleaner"




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] pratyakshsharma commented on a change in pull request #2967: Added blog for Hudi cleaner service

Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on a change in pull request #2967:
URL: https://github.com/apache/hudi/pull/2967#discussion_r640897144



##########
File path: docs/_posts/2021-05-19-employing-right-configurations-for-hudi-cleaner.md
##########
@@ -0,0 +1,77 @@
+---
+title: "Employing correct configurations for Hudi's cleaner table service"
+excerpt: "Achieving isolation between Hudi writer and readers using `HoodieCleaner.java`"
+author: pratyakshsharma
+category: blog
+---
+
+Apache Hudi provides snapshot isolation between writers and readers. This is made possible by Hudi’s MVCC concurrency model. In this blog, we will explain how to employ the right configurations to manage multiple file versions. Furthermore, we will discuss mechanisms available to users generating Hudi tables on how to maintain just the required number of old file versions so that long running readers do not fail. 
+
+### Reclaiming space and bounding your data lake growth
+
+Hudi provides different table management services to be able to manage your tables on the data lake. One of these services is called the **Cleaner**. As you write more data to your table, for every batch of updates received, Hudi can either generate a new version of the data file with updates applied to records (COPY_ON_WRITE) or write these delta updates to a log file, avoiding rewriting newer version of an existing file (MERGE_ON_READ). In such situations, depending on the frequency of your updates, the number of file versions of log files can grow indefinitely. If your use-cases do not require keeping an infinite history of these versions, it is imperative to have a process that reclaims older versions of the data. This is Hudi’s cleaner service.
+
+### Problem Statement
+
+In a data lake architecture, it is a very common scenario to have readers and writers concurrently accessing the same table. As the Hudi cleaner service periodically reclaims older file versions, scenarios arise where a long running query might be accessing a file version that is deemed to be reclaimed by the cleaner. Here, we need to employ the correct configs to ensure readers (aka queries) don’t fail.
+
+### Deeper dive into Hudi Cleaner
+
+To deal with the mentioned scenario, lets understand the  different cleaning policies that Hudi offers and the corresponding properties that need to be configured. Options are available to schedule cleaning asynchronously or synchronously. Before going into more details, we would like to explain a few underlying concepts:
+
+ - **Hudi base file**: Columnar file which consists of final data after compaction. A base file’s name follows the following naming convention: `<fileId>_<writeToken>_<instantTime>.parquet`. In subsequent writes of this file, file id remains the same and commit time gets updated to show the latest version. This also implies any particular version of a record, given its partition path, can be uniquely located using the file id and instant time. 
+ - **File slice**: A file slice consists of the base file and any log files consisting of the delta, in case of MERGE_ON_READ table type.
+ - **Hudi File Group**: Any file group in Hudi is uniquely identified by the partition path and the  file id that the files in this group have as part of their name. A file group consists of all the file slices in a particular partition path. Also any partition path can have multiple file groups.
+
+### Cleaning Policies
+
+Hudi cleaner currently supports below cleaning policies:
+
+ - **KEEP_LATEST_COMMITS**: This is the default policy. This is a temporal cleaning policy that ensures the effect of having lookback into all the changes that happened in the last X commits. Suppose a writer ingesting data  into a Hudi dataset every 30 minutes and the longest running query can take 5 hours to finish, then the user should retain atleast the last 10 commits. With such a configuration, we ensure that the oldest version of a file is kept on disk for at least 5 hours, thereby preventing the longest running query from failing at any point in time. Incremental cleaning is also possible using this policy.
+ - **KEEP_LATEST_FILE_VERSIONS**: This is a static numeric policy that has the effect of keeping N number of file versions irrespective of time. This policy is use-ful when it is known how many MAX versions of the file does one want to keep at any given time. To achieve the same behaviour as before of preventing long running queries from failing, one should do their calculations based on data patterns. Alternatively, this policy is also useful if a user just wants to maintain 1 latest version of the file.
+
+### Examples
+
+Suppose a user uses the below configs for cleaning:
+
+```java
+hoodie.cleaner.policy=KEEP_LATEST_COMMITS
+hoodie.cleaner.commits.retained=10
+```
+
+Cleaner selects the versions of files to be cleaned by taking care of the following:
+
+ - Latest version of a file should not be cleaned.
+ - The commit times of the last 10 (configured) + 1 commits are determined. One extra commit is included because the time window for retaining commits is essentially equal to the longest query run time. So if the longest query takes 5 hours to finish, and ingestion happens every 30 minutes, you need to retain last 10 commits since 10*30 = 300 (5 hours). At this point of time, the longest query can still be using files written in 11th commit in reverse order.  Now for any file group, only those file slices are scheduled for cleaning which are not savepointed (another Hudi table service) and whose commit time is less than the 11th commit in reverse order.
+
+Suppose a user uses the below configs for cleaning:
+
+```java
+hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS
+hoodie.cleaner.fileversions.retained=2
+```
+
+Cleaner does the following:
+
+ - For any file group, last 2 versions (including any for pending compaction) of file slices are kept and the rest are scheduled for cleaning.

Review comment:
       Thank you for pointing this out. Adding visual examples here. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] n3nash commented on a change in pull request #2967: Added blog for Hudi cleaner service

Posted by GitBox <gi...@apache.org>.
n3nash commented on a change in pull request #2967:
URL: https://github.com/apache/hudi/pull/2967#discussion_r636315210



##########
File path: docs/_posts/2021-05-19-employing-right-configurations-for-hudi-cleaner.md
##########
@@ -0,0 +1,77 @@
+---
+title: "Employing correct configurations for Hudi's cleaner table service"
+excerpt: "Achieving isolation between Hudi writer and readers using `HoodieCleaner.java`"
+author: pratyakshsharma
+category: blog
+---
+
+Apache Hudi provides snapshot isolation between writers and readers. This is made possible by Hudi’s MVCC concurrency model. In this blog, we will explain how to employ the right configurations to manage multiple file versions. Furthermore, we will discuss mechanisms available to users generating Hudi tables on how to maintain just the required number of old file versions so that long running readers do not fail. 

Review comment:
       Remove this part "generating Hudi tables"




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on pull request #2967: Added blog for Hudi cleaner service

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on pull request #2967:
URL: https://github.com/apache/hudi/pull/2967#issuecomment-850502987


   Can you build the site locally and take screenshot and attach it here. would be nice to review that as well. 
   for eg: https://github.com/apache/hudi/pull/2969


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] n3nash commented on a change in pull request #2967: Added blog for Hudi cleaner service

Posted by GitBox <gi...@apache.org>.
n3nash commented on a change in pull request #2967:
URL: https://github.com/apache/hudi/pull/2967#discussion_r636321415



##########
File path: docs/_posts/2021-05-19-employing-right-configurations-for-hudi-cleaner.md
##########
@@ -0,0 +1,77 @@
+---
+title: "Employing correct configurations for Hudi's cleaner table service"
+excerpt: "Achieving isolation between Hudi writer and readers using `HoodieCleaner.java`"
+author: pratyakshsharma
+category: blog
+---
+
+Apache Hudi provides snapshot isolation between writers and readers. This is made possible by Hudi’s MVCC concurrency model. In this blog, we will explain how to employ the right configurations to manage multiple file versions. Furthermore, we will discuss mechanisms available to users generating Hudi tables on how to maintain just the required number of old file versions so that long running readers do not fail. 
+
+### Reclaiming space and bounding your data lake growth
+
+Hudi provides different table management services to be able to manage your tables on the data lake. One of these services is called the **Cleaner**. As you write more data to your table, for every batch of updates received, Hudi can either generate a new version of the data file with updates applied to records (COPY_ON_WRITE) or write these delta updates to a log file, avoiding rewriting newer version of an existing file (MERGE_ON_READ). In such situations, depending on the frequency of your updates, the number of file versions of log files can grow indefinitely. If your use-cases do not require keeping an infinite history of these versions, it is imperative to have a process that reclaims older versions of the data. This is Hudi’s cleaner service.
+
+### Problem Statement
+
+In a data lake architecture, it is a very common scenario to have readers and writers concurrently accessing the same table. As the Hudi cleaner service periodically reclaims older file versions, scenarios arise where a long running query might be accessing a file version that is deemed to be reclaimed by the cleaner. Here, we need to employ the correct configs to ensure readers (aka queries) don’t fail.
+
+### Deeper dive into Hudi Cleaner
+
+To deal with the mentioned scenario, lets understand the  different cleaning policies that Hudi offers and the corresponding properties that need to be configured. Options are available to schedule cleaning asynchronously or synchronously. Before going into more details, we would like to explain a few underlying concepts:
+
+ - **Hudi base file**: Columnar file which consists of final data after compaction. A base file’s name follows the following naming convention: `<fileId>_<writeToken>_<instantTime>.parquet`. In subsequent writes of this file, file id remains the same and commit time gets updated to show the latest version. This also implies any particular version of a record, given its partition path, can be uniquely located using the file id and instant time. 
+ - **File slice**: A file slice consists of the base file and any log files consisting of the delta, in case of MERGE_ON_READ table type.
+ - **Hudi File Group**: Any file group in Hudi is uniquely identified by the partition path and the  file id that the files in this group have as part of their name. A file group consists of all the file slices in a particular partition path. Also any partition path can have multiple file groups.
+
+### Cleaning Policies
+
+Hudi cleaner currently supports below cleaning policies:
+
+ - **KEEP_LATEST_COMMITS**: This is the default policy. This is a temporal cleaning policy that ensures the effect of having lookback into all the changes that happened in the last X commits. Suppose a writer ingesting data  into a Hudi dataset every 30 minutes and the longest running query can take 5 hours to finish, then the user should retain atleast the last 10 commits. With such a configuration, we ensure that the oldest version of a file is kept on disk for at least 5 hours, thereby preventing the longest running query from failing at any point in time. Incremental cleaning is also possible using this policy.
+ - **KEEP_LATEST_FILE_VERSIONS**: This is a static numeric policy that has the effect of keeping N number of file versions irrespective of time. This policy is use-ful when it is known how many MAX versions of the file does one want to keep at any given time. To achieve the same behaviour as before of preventing long running queries from failing, one should do their calculations based on data patterns. Alternatively, this policy is also useful if a user just wants to maintain 1 latest version of the file.
+
+### Examples
+
+Suppose a user uses the below configs for cleaning:
+
+```java
+hoodie.cleaner.policy=KEEP_LATEST_COMMITS
+hoodie.cleaner.commits.retained=10
+```
+
+Cleaner selects the versions of files to be cleaned by taking care of the following:
+
+ - Latest version of a file should not be cleaned.
+ - The commit times of the last 10 (configured) + 1 commits are determined. One extra commit is included because the time window for retaining commits is essentially equal to the longest query run time. So if the longest query takes 5 hours to finish, and ingestion happens every 30 minutes, you need to retain last 10 commits since 10*30 = 300 (5 hours). At this point of time, the longest query can still be using files written in 11th commit in reverse order.  Now for any file group, only those file slices are scheduled for cleaning which are not savepointed (another Hudi table service) and whose commit time is less than the 11th commit in reverse order.
+
+Suppose a user uses the below configs for cleaning:
+
+```java
+hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS
+hoodie.cleaner.fileversions.retained=2
+```
+
+Cleaner does the following:
+
+ - For any file group, last 2 versions (including any for pending compaction) of file slices are kept and the rest are scheduled for cleaning.
+
+### Configurations
+
+You can find the details about all the possible configurations along with the default values [here](https://hudi.apache.org/docs/configurations.html#compaction-configs).
+
+### Run command
+
+Hudi's cleaner table service can be run as a separate process or along with your data ingestion. As mentioned earlier, it basically cleans up any stale/old files lying around. In case you want to run it along with ingesting data, configs are available which enable you to run it in [parallel or in sync](https://hudi.apache.org/docs/configurations.html#withAsyncClean). You can use the below command for running the cleaner independently:

Review comment:
       [parallel or in sync] -> synchronously or asynchronously 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] n3nash commented on a change in pull request #2967: Added blog for Hudi cleaner service

Posted by GitBox <gi...@apache.org>.
n3nash commented on a change in pull request #2967:
URL: https://github.com/apache/hudi/pull/2967#discussion_r636315987



##########
File path: docs/_posts/2021-05-19-employing-right-configurations-for-hudi-cleaner.md
##########
@@ -0,0 +1,77 @@
+---
+title: "Employing correct configurations for Hudi's cleaner table service"
+excerpt: "Achieving isolation between Hudi writer and readers using `HoodieCleaner.java`"
+author: pratyakshsharma
+category: blog
+---
+
+Apache Hudi provides snapshot isolation between writers and readers. This is made possible by Hudi’s MVCC concurrency model. In this blog, we will explain how to employ the right configurations to manage multiple file versions. Furthermore, we will discuss mechanisms available to users generating Hudi tables on how to maintain just the required number of old file versions so that long running readers do not fail. 
+
+### Reclaiming space and bounding your data lake growth

Review comment:
       Rename to "Reclaiming space and keeping your data lake storage costs in check"




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan merged pull request #2967: [HUDI-1766] Added blog for Hudi cleaner service

Posted by GitBox <gi...@apache.org>.
nsivabalan merged pull request #2967:
URL: https://github.com/apache/hudi/pull/2967


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org