You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2022/09/13 16:15:43 UTC

[GitHub] [druid] 599166320 opened a new issue, #13080: PeriodLoadRule cannot Remove expired segment

599166320 opened a new issue, #13080:
URL: https://github.com/apache/druid/issues/13080

   
   Recently, when deploying the cold/hot layered Druid cluster, it was found that a hot node loaded data beyond the time range, resulting in the hot node's storage being full soon. I found the same problem on the [Druid forum page](https://www.druidforum.org/t/druid-load-drop-rule/7739), which has not been handled by anyone for a long time. I checked `RunRules.java`, I feel there is a problem. `Periodloadrule` will not delete expired data at all, but only delete too many replicants. Does the current implementation of `PeriodLoadRule` meet expectations?
   
   The following is the current implementation of druid:
   ```
   //RunRules.run
         for (Rule rule : rules) {
           if (rule.appliesTo(segment, now)) {
             if (
                 stats.getGlobalStat(
                     "totalNonPrimaryReplicantsLoaded") >= paramsWithReplicationManager.getCoordinatorDynamicConfig()
                                                                                      .getMaxNonPrimaryReplicantsToLoad()
                 && !paramsWithReplicationManager.getReplicationManager().isLoadPrimaryReplicantsOnly()
             ) {
               log.info(
                   "Maximum number of non-primary replicants [%d] have been loaded for the current RunRules execution. Only loading primary replicants from here on for this coordinator run cycle.",
                   paramsWithReplicationManager.getCoordinatorDynamicConfig().getMaxNonPrimaryReplicantsToLoad()
               );
               paramsWithReplicationManager.getReplicationManager().setLoadPrimaryReplicantsOnly(true);
             }
             stats.accumulate(rule.run(coordinator, paramsWithReplicationManager, segment));
             foundMatchingRule = true;
             break;
           }
         }
   
   ```
   
   
   Now, I have solved this problem by adding `dropallExpireSegments` to `PeriodLoadRule.java`, but I don't know what bad effect it will have.
   
   Here is my implementation:
   
   ```
   //RunRules.run
         for (Rule rule : rules) {
           if (rule.appliesTo(segment, now)) {
             if (
                 stats.getGlobalStat(
                     "totalNonPrimaryReplicantsLoaded") >= paramsWithReplicationManager.getCoordinatorDynamicConfig()
                                                                                      .getMaxNonPrimaryReplicantsToLoad()
                 && !paramsWithReplicationManager.getReplicationManager().isLoadPrimaryReplicantsOnly()
             ) {
               log.info(
                   "Maximum number of non-primary replicants [%d] have been loaded for the current RunRules execution. Only loading primary replicants from here on for this coordinator run cycle.",
                   paramsWithReplicationManager.getCoordinatorDynamicConfig().getMaxNonPrimaryReplicantsToLoad()
               );
               paramsWithReplicationManager.getReplicationManager().setLoadPrimaryReplicantsOnly(true);
             }
             stats.accumulate(rule.run(coordinator, paramsWithReplicationManager, segment));
             foundMatchingRule = true;
             break;
           }else{
             //Add Delete Logic
             rule.dropAllExpireSegments(paramsWithReplicationManager,segment);
           }
         }
   ```
   
   ### Affected Version
   0.22.0
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on issue #13080: PeriodLoadRule cannot Remove expired segment

Posted by GitBox <gi...@apache.org>.
gianm commented on issue #13080:
URL: https://github.com/apache/druid/issues/13080#issuecomment-1248827676

   I believe @AmatyaAvadhanula and @kfaraz were looking into this area of the code too, working to improve the Coordinator balancing behavior. Perhaps they will have some thoughts.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] kfaraz commented on issue #13080: PeriodLoadRule cannot Remove expired segment

Posted by GitBox <gi...@apache.org>.
kfaraz commented on issue #13080:
URL: https://github.com/apache/druid/issues/13080#issuecomment-1250802796

   @599166320 , taking a look at the rules now. 
   Could you also elaborate the fix that you made to the mentioned piece of code?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] 599166320 commented on issue #13080: PeriodLoadRule cannot Remove expired segment

Posted by GitBox <gi...@apache.org>.
599166320 commented on issue #13080:
URL: https://github.com/apache/druid/issues/13080#issuecomment-1250787452

   @kfaraz Have you checked `retention rules`? Will `retention rules` be improved? Or is there no problem now?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] 599166320 commented on issue #13080: PeriodLoadRule cannot Remove expired segment

Posted by GitBox <gi...@apache.org>.
599166320 commented on issue #13080:
URL: https://github.com/apache/druid/issues/13080#issuecomment-1250098863

   @kfaraz I found that default tiers store data for 7 days. Is there a problem with my configuration?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] kfaraz commented on issue #13080: PeriodLoadRule cannot Remove expired segment

Posted by GitBox <gi...@apache.org>.
kfaraz commented on issue #13080:
URL: https://github.com/apache/druid/issues/13080#issuecomment-1250811970

   @599166320 , the configuration seems to be correct for your required behaviour.
   
   It seems that your `cold` tier needs to have 2 replicas. Do you have enough historicals in the `cold` tier?
   If you don't, then it is possible that the segments are never able to reach the target replication of 2.
   Currently, not being at the target replication level can also block the cleanup from the `_default_tier`.
   
   If you have a metric emitter configured, you can check these metrics emitted by the coordinator to
   be certain:
   `segment/underReplicated/count`
   `segment/dropped/count`
   
   Alternatively, you can also take a look at the logs at try to find logs which say something like:
   ```
   "Loading in progress for segment [<segmentIdhere>], skipping drop from tier [<tierNameHere>] until loading is complete! "
   ``` 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] 599166320 commented on issue #13080: PeriodLoadRule cannot Remove expired segment

Posted by GitBox <gi...@apache.org>.
599166320 commented on issue #13080:
URL: https://github.com/apache/druid/issues/13080#issuecomment-1250823125

   @kfaraz The cold tier of our cluster has enough historicals. The current problem is that we don't want to  _default_tier's storage is full too fast.
   
   I will create a PR and let you review it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] kfaraz commented on issue #13080: PeriodLoadRule cannot Remove expired segment

Posted by GitBox <gi...@apache.org>.
kfaraz commented on issue #13080:
URL: https://github.com/apache/druid/issues/13080#issuecomment-1250095588

   @FrankChen021 , maybe there is some middle ground here?
   The default rule should be `loadForever` to allow things to function properly right out the box, without having to configure anything.
   But when a user explicitly modifies the rules (for a datasource or global), it's only natural that the last rule be a `forever` rule, either load or drop. If the user has already given it as so, good, otherwise, we tack on a `dropForever` rule at the end ourselves. (The user wouldn't mind us doing this as they were particular about loading data only upto a certain period/interval.)
   
   Translated to the code, this simply means that if no rule matches, we drop the segment. This is because no rule matching implies that the retention rules have been explicitly modified by the user, otherwise the default `loadForever` would have matched. 
   
   We already have an alert for segments that don't match any rule but the behaviour should still be reconsidered because it is indeed a little counterintuitive.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] kfaraz commented on issue #13080: PeriodLoadRule cannot Remove expired segment

Posted by GitBox <gi...@apache.org>.
kfaraz commented on issue #13080:
URL: https://github.com/apache/druid/issues/13080#issuecomment-1250095450

   @599166320 , for your case, what you need may look like this:
   ```
   loadByPeriod P3D _defaultTier=1 replica, cold=1 replica
   loadByPeriod P7D cold=1 replica
   dropForever
   ```
   1. Data not more than 3 days old lives on both default and cold tiers
   2. Data more than 3 days old and upto 7 days old lives only on cold tier
   3. older data is dropped
   The only thing I added was the cold tier replicas in the first rule. If you don't have that, cold tier will not have data of the past 3 days. Is this what you intended?
   
   @FrankChen021 , maybe there is some middle ground here?
   The default rule should be `loadForever` to allow things to function properly right out the box, without having to configure anything.
   But when a user explicitly modifies the rules (for a datasource or global), it's only natural that the last rule be a `forever` rule, either load or drop. If the user has already given it as so, good, otherwise, we tack on a `dropForever` rule at the end ourselves. (The user wouldn't mind us doing this as they were particular about loading data only upto a certain period/interval.)
   
   Translated to the code, this simply means that if no rule matches, we drop the segment. This is because no rule matching implies that the retention rules have been explicitly modified by the user, otherwise the default `loadForever` would have matched. 
   
   We already have an alert for segments that don't match any rule but the behaviour should still be reconsidered because it is indeed a little counterintuitive.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] kfaraz commented on issue #13080: PeriodLoadRule cannot Remove expired segment

Posted by GitBox <gi...@apache.org>.
kfaraz commented on issue #13080:
URL: https://github.com/apache/druid/issues/13080#issuecomment-1248845069

   @599166320 , drop is handled by `DropRule`s like a `ForeverDropRule`, none of the `LoadRule`s are supposed to have that capability. You typically specify a bunch of rules for each datasource. The coordinator tries to find the first rule which applies to a given segment at a given time and tries to do what that matched rule suggests. If at any point in the lifetime of a segment, it matches with a `DropRule`, it gets dropped.
   
   So, I think the problem you are facing can be solved by simply having a `ForeverDropRule` at the end of your retention rule list (default or datasource-specific). Please let us know if this works for you.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] FrankChen021 commented on issue #13080: PeriodLoadRule cannot Remove expired segment

Posted by GitBox <gi...@apache.org>.
FrankChen021 commented on issue #13080:
URL: https://github.com/apache/druid/issues/13080#issuecomment-1249111218

   I think @kfaraz 's answer makes sense.
   
   Actually, our [doc](https://druid.apache.org/docs/latest/operations/rule-configuration.html#load-rules) has pointed out this:
   
   > Load rules indicate how many replicas of a segment should exist in a server tier. Please note: If a Load rule is used to retain only data from a certain interval or period, it must be accompanied by a Drop rule. If a Drop rule is not included, data not within the specified interval or period will be retained by the default rule (loadForever).
   
   Although we have doc to state this, I think current default rule(loadForever) is little bit counterintuitive. 
   Maybe the default rule should be `dropForever`? But because Druid provides flexible retention rules(both load and drop), chaning the default rule to `dropForever` might not be so easy. 
   
   In constrast to Druid, some other DBs, only provides load rules(not the same concept, but similiar to Druid's load rule), so it makes sense that its default behaviour is a `dropForever` rule.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] 599166320 commented on issue #13080: PeriodLoadRule cannot Remove expired segment

Posted by GitBox <gi...@apache.org>.
599166320 commented on issue #13080:
URL: https://github.com/apache/druid/issues/13080#issuecomment-1250101211

   > @599166320 , could you please share the json of your rules for this datasource?
   
   You can see it in the[ Druid forum](https://www.druidforum.org/t/druid-load-drop-rule/7739).
   
   In addition, I found a problem with the following code, and I have repaired it. I ran normally for 2 days
   
   
   ```
   	  //RunRules.run
         for (Rule rule : rules) {
           if (rule.appliesTo(segment, now)) {
             //
             if (
                 stats.getGlobalStat(
                     "totalNonPrimaryReplicantsLoaded") >= paramsWithReplicationManager.getCoordinatorDynamicConfig()
                                                                                      .getMaxNonPrimaryReplicantsToLoad()
                 && !paramsWithReplicationManager.getReplicationManager().isLoadPrimaryReplicantsOnly()
             ) {
               log.info(
                   "Maximum number of non-primary replicants [%d] have been loaded for the current RunRules execution. Only loading primary replicants from here on for this coordinator run cycle.",
                   paramsWithReplicationManager.getCoordinatorDynamicConfig().getMaxNonPrimaryReplicantsToLoad()
               );
               paramsWithReplicationManager.getReplicationManager().setLoadPrimaryReplicantsOnly(true);
             }
             stats.accumulate(rule.run(coordinator, paramsWithReplicationManager, segment));
             foundMatchingRule = true;
             break;
           }
         }
   
   
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] 599166320 commented on issue #13080: PeriodLoadRule cannot Remove expired segment

Posted by GitBox <gi...@apache.org>.
599166320 commented on issue #13080:
URL: https://github.com/apache/druid/issues/13080#issuecomment-1248999401

   thank  @kfaraz   for your reply. If I want the _defaultTier node to keep the hot data of the last 3 days, and the cold node to keep the cold data of the last 7 days, how should my retention rules be configured?
   My retention rules are configured as follows:
   ```
   loadByPeriod P3D _defaultTier
   loadByPeriod P7D cold
   dropForever
   ```
   
   Will this meet my needs?
   According to the current implementation of RunRules, the _defaultTier node will retain data for 7 days, and retain data for 4 more days. I think it is unreasonable, what do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] kfaraz commented on issue #13080: PeriodLoadRule cannot Remove expired segment

Posted by GitBox <gi...@apache.org>.
kfaraz commented on issue #13080:
URL: https://github.com/apache/druid/issues/13080#issuecomment-1250100109

   @599166320 , could you please share the json of your rules for this datasource?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] 599166320 commented on issue #13080: PeriodLoadRule cannot Remove expired segment

Posted by GitBox <gi...@apache.org>.
599166320 commented on issue #13080:
URL: https://github.com/apache/druid/issues/13080#issuecomment-1295842146

   Today, another cluster of ours (without any modification) once again experienced the problem that the expired data of the hot node was not deleted, causing the storage space to be quickly used up.
   
   
   ![s19ZIAf4Q2](https://user-images.githubusercontent.com/3204398/198834435-00348f2c-60b8-4bc6-a473-eb421e385783.jpg)
   
   ![img_v2_5824d26b-6812-4392-a17e-14021172683h](https://user-images.githubusercontent.com/3204398/198834451-50428cf6-edb7-4aa5-b907-e0fd7aea4e56.jpg)
   
   It looks like the storage space of the hot node is almost full, and the cold node still has enough storage space.
   
   
   ![image](https://user-images.githubusercontent.com/3204398/198834605-3a16aa11-090e-485a-ac9f-1f2742d3142f.png)
   
   ![image](https://user-images.githubusercontent.com/3204398/198834623-c499a12c-7fa0-4aed-9707-729b7ff2ef1f.png)
   
   Here are some monitoring:
   
   ![image](https://user-images.githubusercontent.com/3204398/198834682-d5e97ef9-c8c3-4eac-9eb1-00b452fb98b9.png)
   
   ![image](https://user-images.githubusercontent.com/3204398/198834721-5042203c-9751-48d0-bd49-cbbc284dfaf4.png)
   
   
   
   After this problem occurred, I quickly emitted the monitoring data to promethues. After observing for two hours, I upgraded the master node and applied `LoadDropByPeriod`, and the cluster quickly returned to normal. Of course, our cluster also found some problems in the process of applying `LoadDropByPeriod`, such as the need to limit the speed of deletion.
   
   
   I still have a few hours of monitoring data on my side that I can share with you if you need it. @kfaraz 
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org