You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by "abhishekrb19 (via GitHub)" <gi...@apache.org> on 2023/05/23 04:50:34 UTC

[GitHub] [druid] abhishekrb19 opened a new issue, #14330: New storage / retention policy API in Druid

abhishekrb19 opened a new issue, #14330:
URL: https://github.com/apache/druid/issues/14330

   ### Motivation
   
   While Druid's [rules](https://druid.apache.org/docs/latest/operations/rule-configuration.html) (load, drop, and broadcast rules) and [kill tasks](https://druid.apache.org/docs/latest/data-management/delete.html) are powerful, they can be complex to use and understand, especially in the context of retention. Druid users need to think about the lifecycle of segments (used/unused), map to tiered replicants, and add the appropriate imperative rules in the correct order to the rule chain.
   
   ### Proposed changes
   At a high level, users can define a storage policy for the hot tier (aka historical tier) and the deep storage. To that effect, introduce a storage policy API that translates user-defined policies to one or more load and drop rules under the hoods. 
   
   #### New API `/druid/coordinator/v1/storagePolicy/<dataSource>`
   
   The API will accept two parameters in the create payload:
   - `hot`: Defines how long to keep the data in the hot tier(s) (aka historical tiers)
   - `retain`: Defines how long to retain the data before it's cleaned up permanently, including data from the deep storage and metadata store
   
   
   #### Translation of storage policy to load & drop rules
   A few use cases along with the storage policy payloads and the corresponding internal load/drop rules is shown below:
   
   |Intent | Storage Policy | Load/Drop Rule  |
   |------ | ------ | --------- |
   |Keep the most recent hour of data <br> in the hot tier and permanently delete <br> all data older than 30 days.|<pre>{<br>&nbsp;&nbsp;"hot": {<br>&nbsp;&nbsp;&nbsp;&nbsp;"type": "period",<br>&nbsp;&nbsp;&nbsp;&nbsp;"period": "PT1H"<br>&nbsp;&nbsp;},<br>&nbsp;&nbsp;"retain": {<br>&nbsp;&nbsp;&nbsp;&nbsp;"type": "period",<br>&nbsp;&nbsp;&nbsp;&nbsp;"period": "P30D"<br>&nbsp;&nbsp;}<br>}</pre>|<pre>[<br>&nbsp;&nbsp;{<br>&nbsp;&nbsp;&nbsp;&nbsp;"type": "loadByPeriod",<br>&nbsp;&nbsp;&nbsp;&nbsp;"period": "PT1H",<br>&nbsp;&nbsp;&nbsp;&nbsp;"tieredReplicants": {<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"_default_tier": 1<br>&nbsp;&nbsp;&nbsp;&nbsp;}<br>&nbsp;&nbsp;},<br>&nbsp;&nbsp;{<br>&nbsp;&nbsp;&nbsp;&nbsp;"type": "dropBeforeByPeriod",<br>&nbsp;&nbsp;&nbsp;&nbsp;"period": "P30D"<br>&nbsp;&nbsp;},<br>&nbsp;&nbsp;{<br>&nbsp;&nbsp;&nbsp;&nbsp;"type": "loadForever",<br>&nbsp;&nbsp;&nbsp;&nbsp;"tieredReplicants": {<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"_default_tier": 0<br>&nbsp;&nbsp;&
 nbsp;&nbsp;}<br>&nbsp;&nbsp;}<br>]</pre>|
   | Drop all data older <br> than 30 days from the hot tier.| <pre>{<br>&nbsp;&nbsp;"hot": {<br>&nbsp;&nbsp;&nbsp;&nbsp;"type": "period",<br>&nbsp;&nbsp;&nbsp;&nbsp;"period": "P30D"<br>&nbsp;&nbsp;}<br>}</pre>  |<pre>[<br>&nbsp;&nbsp;{<br>&nbsp;&nbsp;&nbsp;&nbsp;"type": "loadByPeriod",<br>&nbsp;&nbsp;&nbsp;&nbsp;"period": "P30D",<br>&nbsp;&nbsp;&nbsp;&nbsp;"tieredReplicants": {<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"_default_tier": 1<br>&nbsp;&nbsp;&nbsp;&nbsp;}<br>&nbsp;&nbsp;},<br>&nbsp;&nbsp;{<br>&nbsp;&nbsp;&nbsp;&nbsp;"type": "loadForever",<br>&nbsp;&nbsp;&nbsp;&nbsp;"tieredReplicants": {<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"_default_tier": 0<br>&nbsp;&nbsp;&nbsp;&nbsp;}<br>&nbsp;&nbsp;}<br>]</pre>|
   | Delete all data older <br> than 60 days.| <pre>{<br>&nbsp;&nbsp;"retain": {<br>&nbsp;&nbsp;&nbsp;&nbsp;"type": "period",<br>&nbsp;&nbsp;&nbsp;&nbsp;"period": "P60D"<br>&nbsp;&nbsp;}<br>}</pre> |<pre>[<br>&nbsp;&nbsp;{<br>&nbsp;&nbsp;&nbsp;&nbsp;"type": "dropBeforeByPeriod",<br>&nbsp;&nbsp;&nbsp;&nbsp;"period": "P60D"<br>&nbsp;&nbsp;},<br>&nbsp;&nbsp;{<br>&nbsp;&nbsp;&nbsp;&nbsp;"type": "loadForever"<br>&nbsp;&nbsp;}<br>]</pre>|
   
   
   #### Extensibility & Maintainability
   Similar to the above period-based policies, we can add interval-based and custom tiered-policies for more advanced users. For example:
   a. Interval-based policy:
   ```json
   {
     "hot": {
       "type": "intervals",
       "intervals": ["2020-01-01/2022-01-01", "2023-01-01/9999-01-01"]
     }
   }
   ```
   b. Custom-tiered policy:
   ```json
   {
     "hot": {
       "type": "tiered",
       "tiers": {
         "hot1": {"type": "period", "period": "P60D"},
         "hot2": {"type": "period", "period": "P90D"}
       }
     },
     "retain": {"type": "period", "period": "P1Y"}
   }
   ```
   The API will need to translate user-defined storage policies to rules as we extend  support to cover more complex use cases.
   
   #### High-level implementation
   
   The API implementation will support `POST`, `GET` and `DELETE` operations to create, retrieve and delete any configured storage policy per data source. Similar to the `rules` endpoint, this new endpoint should be on the coordinator and should return appropriate error/status codes to the user. The implementation of the API will:
   - Validate ISO 8601 periods
   - Validate Interval strings
   - Check that `hot.period` cannot be larger than `retain.period`
   - Disallow `retain` if auto-kill configuration is disabled
   
   ### Rationale
   
   The main benefit of the API is that it abstracts away the complex inner workings of load, drop and kill rules. It provides a declarative interface to think about retention like many systems offer.
   
   ### Operational impact
   
   Since this API-only change leverages the existing load/drop rule functionality, nothing needs to be deprecated in short order. If it makes sense to deprecate the rules API at some point because the new API is equally powerful, then we may consider that. 
   
   ### Future work
   
   In environments with multiple hot tiers, users must manually enumerate the tiers in the tieredReplicants if they use load rules. We can extend the storage policy API to automatically list all the tiers by default if it's not supplied.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on issue #14330: New storage / retention policy API in Druid

Posted by "gianm (via GitHub)" <gi...@apache.org>.
gianm commented on issue #14330:
URL: https://github.com/apache/druid/issues/14330#issuecomment-1559830650

   Some initial thoughts:
   
   From the "operational impact" section, it sounds like there is no storage as part of this proposal, just syntax sugar on 
   load/drop rules. Is this correct? If so:
   
   - what happens on `GET` if there are some load/drop rules for a datasource that can't be mapped onto a storage policy object? Or, is it possible for all combinations of load/drop rules to be mapped onto storage policy?
   - how are the cluster-wide default load/drop rules dealt with?
   
   Additionally, did you consider options where the storage policy _is_ a real object that is stored, perhaps in the new (& currently not-really-used) catalog? In that case the API would be through `CatalogResource`. Curious what you see as the pros and cons of these two approaches.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] vogievetsky commented on issue #14330: New storage / retention policy API in Druid

Posted by "vogievetsky (via GitHub)" <gi...@apache.org>.
vogievetsky commented on issue #14330:
URL: https://github.com/apache/druid/issues/14330#issuecomment-1559895699

   Overall I really like the design.
   
   Some thoughts and questions:
   
   - Will there be an API to fetch all the storage policies together? The equivalent of `GET /druid/coordinator/v1/rules` that is needed for the console.
   - Like Gian I want to know what will actually be stored in the DB? Will setting a storagePolicy actually just write some new load rules that I will be able to see if I query the load rules directly? What if I set a storagePolicy and then write a load rule will it effectively "update" the storage policy? I am 👍 for an approach where this is all just sugar on top of existing load rules and the storage policy is not a real object that is stored.
   - `Disallow retain if auto-kill configuration is disabled` I have issues with that. What if you set retain and then disable the auto-kill. How will the UI know if auto-kill is enabled or not (so as to know to render the `retain` controls or not)
   - What is going on here: 
     <img width="765" alt="image" src="https://github.com/apache/druid/assets/177816/723dbcc5-9d69-4146-ad12-583b864c7ceb">
     What happens if `hot` is not set is everything hot or nothing is hot? I think the load rule part of the example suggests that everything is hot?
   - What would the storage policy for the current default of everything is hot look like?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org