You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "nsivabalan (via GitHub)" <gi...@apache.org> on 2023/03/31 18:08:18 UTC

[GitHub] [hudi] nsivabalan commented on a diff in pull request #8062: [HUDI-5823][RFC-65] RFC for Partition TTL Management

nsivabalan commented on code in PR #8062:
URL: https://github.com/apache/hudi/pull/8062#discussion_r1154738916


##########
rfc/rfc-65/rfc-65.md:
##########
@@ -0,0 +1,110 @@
+## Proposers
+- @stream2000
+- @hujincalrin
+- @huberylee
+- @YuweiXiao
+## Approvers
+## Status
+JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823)
+## Abstract
+In some classic hudi use cases, users partition hudi data by time and are only interested in data from a recent period of time. The outdated data is useless and costly,  we need a TTL(Time-To-Live) management mechanism to prevent the dataset from growing infinitely.
+This proposal introduces Partition TTL Management policies to hudi, people can config the policies by table config directly or by call commands. With proper configs set, Hudi can find out which partitions are outdated and delete them.
+## Background
+TTL management mechanism is an important feature for databases. Hudi already provides a delete_partition interface to delete outdated partitions. However, users still need to detect which partitions are outdated and call `delete_partition` manually, which means that users need to define and implement some kind of TTL policies and maintain proper statistics to find expired partitions by themself. As the scale of installations grew,  it's more important to implement a user-friendly TTL management mechanism for hudi.
+## Implementation
+There are 3 components to implement Partition TTL Management
+
+- TTL policy definition & storage
+- Partition statistics for TTL management
+- Appling policies
+### TTL Policy Definition
+We have three main considerations when designing TTL policy:
+
+1. User hopes to manage partition TTL not only by  expired time but also by sub-partitions count and sub-partitions size. So we need to support the following three different TTL policy types.
+    1. **KEEP_BY_TIME**. Partitions will expire N days after their last modified time.
+    2. **KEEP_BY_COUNT**. Keep N sub-partitions for a  high-level partition. When sub partition count exceeds, delete the partitions with smaller partition values until the sub-partition count meets the policy configuration.

Review Comment:
   can you help us understand the use-case here. I mean, I am trying to get an understanding of the sub-partitions here. in hudi, we have only one partitioning, but if could be multi-leveled. so, trying to see, if we can keep it high level. 



##########
rfc/rfc-65/rfc-65.md:
##########
@@ -0,0 +1,110 @@
+## Proposers
+- @stream2000
+- @hujincalrin
+- @huberylee
+- @YuweiXiao
+## Approvers
+## Status
+JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823)
+## Abstract
+In some classic hudi use cases, users partition hudi data by time and are only interested in data from a recent period of time. The outdated data is useless and costly,  we need a TTL(Time-To-Live) management mechanism to prevent the dataset from growing infinitely.
+This proposal introduces Partition TTL Management policies to hudi, people can config the policies by table config directly or by call commands. With proper configs set, Hudi can find out which partitions are outdated and delete them.
+## Background
+TTL management mechanism is an important feature for databases. Hudi already provides a delete_partition interface to delete outdated partitions. However, users still need to detect which partitions are outdated and call `delete_partition` manually, which means that users need to define and implement some kind of TTL policies and maintain proper statistics to find expired partitions by themself. As the scale of installations grew,  it's more important to implement a user-friendly TTL management mechanism for hudi.
+## Implementation
+There are 3 components to implement Partition TTL Management
+
+- TTL policy definition & storage
+- Partition statistics for TTL management
+- Appling policies
+### TTL Policy Definition
+We have three main considerations when designing TTL policy:
+
+1. User hopes to manage partition TTL not only by  expired time but also by sub-partitions count and sub-partitions size. So we need to support the following three different TTL policy types.
+    1. **KEEP_BY_TIME**. Partitions will expire N days after their last modified time.
+    2. **KEEP_BY_COUNT**. Keep N sub-partitions for a  high-level partition. When sub partition count exceeds, delete the partitions with smaller partition values until the sub-partition count meets the policy configuration.
+    3. **KEEP_BY_SIZE**. Similar to KEEP_BY_COUNT, but to ensure that the sum of the data size of all sub-partitions does not exceed the policy configuration.
+2. User need to set different policies for different partitions. For example, the hudi table is partitioned by two fields (user_id, ts). For partition(user_id='1'), we set the policy to keep 100G data for all sub-partitions, and for partition(user_id='2') we set the policy that all sub-partitions will expire 10 days after their last modified time.

Review Comment:
   we should be able to add regex and achieve this.
   for eg, 
   Map<{PartitionRegex/Static Partitions{ -> {TTL policy} >
   so, this map can have multiple entries as well.



##########
rfc/rfc-65/rfc-65.md:
##########
@@ -0,0 +1,110 @@
+## Proposers
+- @stream2000
+- @hujincalrin
+- @huberylee
+- @YuweiXiao
+## Approvers
+## Status
+JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823)
+## Abstract
+In some classic hudi use cases, users partition hudi data by time and are only interested in data from a recent period of time. The outdated data is useless and costly,  we need a TTL(Time-To-Live) management mechanism to prevent the dataset from growing infinitely.
+This proposal introduces Partition TTL Management policies to hudi, people can config the policies by table config directly or by call commands. With proper configs set, Hudi can find out which partitions are outdated and delete them.
+## Background
+TTL management mechanism is an important feature for databases. Hudi already provides a delete_partition interface to delete outdated partitions. However, users still need to detect which partitions are outdated and call `delete_partition` manually, which means that users need to define and implement some kind of TTL policies and maintain proper statistics to find expired partitions by themself. As the scale of installations grew,  it's more important to implement a user-friendly TTL management mechanism for hudi.
+## Implementation
+There are 3 components to implement Partition TTL Management
+
+- TTL policy definition & storage
+- Partition statistics for TTL management
+- Appling policies
+### TTL Policy Definition
+We have three main considerations when designing TTL policy:
+
+1. User hopes to manage partition TTL not only by  expired time but also by sub-partitions count and sub-partitions size. So we need to support the following three different TTL policy types.
+    1. **KEEP_BY_TIME**. Partitions will expire N days after their last modified time.
+    2. **KEEP_BY_COUNT**. Keep N sub-partitions for a  high-level partition. When sub partition count exceeds, delete the partitions with smaller partition values until the sub-partition count meets the policy configuration.

Review Comment:
   I feel, both (2) and (3) is very much catered towards multi-field partitioning like an ProductId/datstr based partitioning. can we layout high level strategies for one level partitioning as well in addition to multi-field partitioning. 



##########
rfc/rfc-65/rfc-65.md:
##########
@@ -0,0 +1,110 @@
+## Proposers
+- @stream2000
+- @hujincalrin
+- @huberylee
+- @YuweiXiao
+## Approvers
+## Status
+JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823)
+## Abstract
+In some classic hudi use cases, users partition hudi data by time and are only interested in data from a recent period of time. The outdated data is useless and costly,  we need a TTL(Time-To-Live) management mechanism to prevent the dataset from growing infinitely.
+This proposal introduces Partition TTL Management policies to hudi, people can config the policies by table config directly or by call commands. With proper configs set, Hudi can find out which partitions are outdated and delete them.
+## Background
+TTL management mechanism is an important feature for databases. Hudi already provides a delete_partition interface to delete outdated partitions. However, users still need to detect which partitions are outdated and call `delete_partition` manually, which means that users need to define and implement some kind of TTL policies and maintain proper statistics to find expired partitions by themself. As the scale of installations grew,  it's more important to implement a user-friendly TTL management mechanism for hudi.
+## Implementation
+There are 3 components to implement Partition TTL Management
+
+- TTL policy definition & storage
+- Partition statistics for TTL management
+- Appling policies
+### TTL Policy Definition
+We have three main considerations when designing TTL policy:
+
+1. User hopes to manage partition TTL not only by  expired time but also by sub-partitions count and sub-partitions size. So we need to support the following three different TTL policy types.
+    1. **KEEP_BY_TIME**. Partitions will expire N days after their last modified time.
+    2. **KEEP_BY_COUNT**. Keep N sub-partitions for a  high-level partition. When sub partition count exceeds, delete the partitions with smaller partition values until the sub-partition count meets the policy configuration.
+    3. **KEEP_BY_SIZE**. Similar to KEEP_BY_COUNT, but to ensure that the sum of the data size of all sub-partitions does not exceed the policy configuration.
+2. User need to set different policies for different partitions. For example, the hudi table is partitioned by two fields (user_id, ts). For partition(user_id='1'), we set the policy to keep 100G data for all sub-partitions, and for partition(user_id='2') we set the policy that all sub-partitions will expire 10 days after their last modified time.
+3. It's possible that there are a lot of high-level partitions in the user's table,  and they don't want to set TTL policies for all the high-level partitions. So we need to provide a default policy mechanism so that users can set a default policy for all high-level partitions and add some explicit policies for some of them if needed. Explicit policies will override the default policy.
+
+So here we have the TTL policy definition:
+```java
+public class HoodiePartitionTTLPolicy {
+  public enum TTLPolicy {
+    KEEP_BY_TIME, KEEP_BY_SIZE, KEEP_BY_COUNT
+  }
+
+  // Partition spec for which the policy takes effect
+  private String partitionSpec;
+
+  private TTLPolicy policy;
+
+  private long policyValue;
+}
+```
+
+### User Interface for TTL policy
+Users can config partition TTL management policies through SparkSQL Call Command and through table config directly.  Assume that the user has a hudi table partitioned by two fields(user_id, ts), he can config partition TTL policies as follows.
+
+```sql
+-- Set default policy for all user_id, which keeps the data for 30 days.
+call add_ttl_policy(table => 'test', partitionSpec => 'user_id=*/', policy => 'KEEP_BY_TIME', policyValue => '30');
+ 
+--For partition user_id=1/, keep 10 sub partitions.
+call add_ttl_policy(table => 'test', partitionSpec => 'user_id=1/', policy => 'KEEP_BY_COUNT', policyValue => '10');
+
+--For partition user_id=2/, keep 100GB data in total
+call add_ttl_policy(table => 'test', partitionSpec => 'user_id=2/', policy => 'KEEP_BY_SIZE', policyValue => '107374182400');
+
+--For partition user_id=3/, keep the data for 7 day.
+call add_ttl_policy(table => 'test', partitionSpec => 'user_id=3/', policy => 'KEEP_BY_TIME', policyValue => '7');
+
+-- Show all the TTL policies including default and explicit policies
+call show_ttl_policies(table => 'test');
+user_id=*/	KEEP_BY_TIME	30
+user_id=1/	KEEP_BY_COUNT	10
+user_id=2/	KEEP_BY_SIZE	107374182400
+user_id=3/	KEEP_BY_TIME	7
+```
+
+### Storage for TTL policy
+The partition TTL policies will be stored in `hoodie.properties`since it is part of table metadata. The policy configs in `hoodie.properties`are defined as follows. Explicit policies are defined using a JSON array while default policy is defined using separate configs.

Review Comment:
   we should avoid using hoodie.properties for storing write configs. for eg, we don't store cleaning/compaction sceduling/execution strategies in hoodie.properties. 
   Users can start w/ 100GB as TTL policy and later change it to 50GB for instance. So, these are strictly write configs in my opinion. 
   



##########
rfc/rfc-65/rfc-65.md:
##########
@@ -0,0 +1,110 @@
+## Proposers
+- @stream2000
+- @hujincalrin
+- @huberylee
+- @YuweiXiao
+## Approvers
+## Status
+JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823)
+## Abstract
+In some classic hudi use cases, users partition hudi data by time and are only interested in data from a recent period of time. The outdated data is useless and costly,  we need a TTL(Time-To-Live) management mechanism to prevent the dataset from growing infinitely.
+This proposal introduces Partition TTL Management policies to hudi, people can config the policies by table config directly or by call commands. With proper configs set, Hudi can find out which partitions are outdated and delete them.
+## Background
+TTL management mechanism is an important feature for databases. Hudi already provides a delete_partition interface to delete outdated partitions. However, users still need to detect which partitions are outdated and call `delete_partition` manually, which means that users need to define and implement some kind of TTL policies and maintain proper statistics to find expired partitions by themself. As the scale of installations grew,  it's more important to implement a user-friendly TTL management mechanism for hudi.
+## Implementation
+There are 3 components to implement Partition TTL Management
+
+- TTL policy definition & storage
+- Partition statistics for TTL management
+- Appling policies
+### TTL Policy Definition
+We have three main considerations when designing TTL policy:
+
+1. User hopes to manage partition TTL not only by  expired time but also by sub-partitions count and sub-partitions size. So we need to support the following three different TTL policy types.
+    1. **KEEP_BY_TIME**. Partitions will expire N days after their last modified time.

Review Comment:
   what is last mod time. is it referring to new inserts, or updates as well ? 



##########
rfc/rfc-65/rfc-65.md:
##########
@@ -0,0 +1,110 @@
+## Proposers
+- @stream2000
+- @hujincalrin
+- @huberylee
+- @YuweiXiao
+## Approvers
+## Status
+JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823)
+## Abstract
+In some classic hudi use cases, users partition hudi data by time and are only interested in data from a recent period of time. The outdated data is useless and costly,  we need a TTL(Time-To-Live) management mechanism to prevent the dataset from growing infinitely.
+This proposal introduces Partition TTL Management policies to hudi, people can config the policies by table config directly or by call commands. With proper configs set, Hudi can find out which partitions are outdated and delete them.
+## Background
+TTL management mechanism is an important feature for databases. Hudi already provides a delete_partition interface to delete outdated partitions. However, users still need to detect which partitions are outdated and call `delete_partition` manually, which means that users need to define and implement some kind of TTL policies and maintain proper statistics to find expired partitions by themself. As the scale of installations grew,  it's more important to implement a user-friendly TTL management mechanism for hudi.
+## Implementation
+There are 3 components to implement Partition TTL Management
+
+- TTL policy definition & storage
+- Partition statistics for TTL management
+- Appling policies
+### TTL Policy Definition
+We have three main considerations when designing TTL policy:
+
+1. User hopes to manage partition TTL not only by  expired time but also by sub-partitions count and sub-partitions size. So we need to support the following three different TTL policy types.
+    1. **KEEP_BY_TIME**. Partitions will expire N days after their last modified time.
+    2. **KEEP_BY_COUNT**. Keep N sub-partitions for a  high-level partition. When sub partition count exceeds, delete the partitions with smaller partition values until the sub-partition count meets the policy configuration.

Review Comment:
   is it possible to simplify the strategies where in we can achieve it for both single or multi field partitioning. for eg, 
   TTL any partition whose last mod time (last time when data was added/updated), is > 1 month for eg. this will work for both a single field partitioning (datestr), or multi-field (productId/datestr). 
   Open to ideas. 



##########
rfc/rfc-65/rfc-65.md:
##########
@@ -0,0 +1,110 @@
+## Proposers
+- @stream2000
+- @hujincalrin
+- @huberylee
+- @YuweiXiao
+## Approvers
+## Status
+JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823)
+## Abstract
+In some classic hudi use cases, users partition hudi data by time and are only interested in data from a recent period of time. The outdated data is useless and costly,  we need a TTL(Time-To-Live) management mechanism to prevent the dataset from growing infinitely.
+This proposal introduces Partition TTL Management policies to hudi, people can config the policies by table config directly or by call commands. With proper configs set, Hudi can find out which partitions are outdated and delete them.
+## Background
+TTL management mechanism is an important feature for databases. Hudi already provides a delete_partition interface to delete outdated partitions. However, users still need to detect which partitions are outdated and call `delete_partition` manually, which means that users need to define and implement some kind of TTL policies and maintain proper statistics to find expired partitions by themself. As the scale of installations grew,  it's more important to implement a user-friendly TTL management mechanism for hudi.
+## Implementation
+There are 3 components to implement Partition TTL Management
+
+- TTL policy definition & storage
+- Partition statistics for TTL management
+- Appling policies
+### TTL Policy Definition
+We have three main considerations when designing TTL policy:
+
+1. User hopes to manage partition TTL not only by  expired time but also by sub-partitions count and sub-partitions size. So we need to support the following three different TTL policy types.
+    1. **KEEP_BY_TIME**. Partitions will expire N days after their last modified time.
+    2. **KEEP_BY_COUNT**. Keep N sub-partitions for a  high-level partition. When sub partition count exceeds, delete the partitions with smaller partition values until the sub-partition count meets the policy configuration.

Review Comment:
   we should also call out that the sub-partitioning might work only for day based or time based sub-partitioning right. for eg, lets say, if partitioning is datestr/productId. how do we know out of 1000 productIds under a given day, which 100 is older or newer (assuming all 1000 was created in same commit). 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org