You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2021/11/24 16:20:33 UTC

[GitHub] [druid] yuanlihan opened a new pull request #11989: Add segment merged result cache for broker

yuanlihan opened a new pull request #11989:
URL: https://github.com/apache/druid/pull/11989


   
   ### Description
   This PR add a new cache feature for broker.
   |Property|Possible Values|Description|Default|
   |--------|---------------|-----------|-------|
   |`druid.broker.cache.useSegmentMergedResultCache`|true, false|Enable segment merged result cache on the Broker.|false|
   |`druid.broker.cache.populateSegmentMergedResultCache`|true, false|Populate segment merged result cache on the Broker.|false|
   
   This new cache feature will cache merged results of historical segments on Broker. It's expected as a better cache option for Broker in some use cases.
   |                      | Case 1: can work when using group by v2 engine | Case 2: can work when scanning historical segments only | Case 3: can work when scanning both historical and realtime segments | Case 4: be efficient to cache large number of segments |
   | -------------------- | ---------------------------------------------- | ------------------------------------------------------- | ------------------------------------------------------------ | ------------------------------------------------------ |
   | Segment Level cache  | &cross;                                        | &check;                                                 | &check;                                                      | &cross;                                                |
   | Result level cache   | &cross;                                        | &check;                                                 | &cross;                                                      | &check;                                                |
   | Segment merged cache | &check;                                        | &check;                                                 | &check;                                                      | &check;                                                |
   
   
   I will try to write benchmarks to compare these cache options and may paste benchmarks result here later
   
   ##### Key changed/added classes in this PR
    * `CachingClusteredClient`
    * `CacheUtil`
    * `ForegroundCachePopulator`
    * `BackgroundCachePopulator`
   
   <hr>
   
   <!-- Check the items by putting "x" in the brackets for the done things. Not all of these items apply to every PR. Remove the items which are not done or not relevant to the PR. None of the items from the checklist below are strictly necessary, but it would be very helpful if you at least self-review the PR. -->
   
   This PR has:
   - [ ] been self-reviewed.
      - [ ] using the [concurrency checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md) (Remove this item if the PR doesn't have any relation to concurrency.)
   - [ ] added documentation for new or modified features or behaviors.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in [licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
   - [ ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
   - [ ] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met.
   - [ ] added integration tests.
   - [ ] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] yuanlihan commented on pull request #11989: Add segment merged result cache for broker

Posted by GitBox <gi...@apache.org>.
yuanlihan commented on pull request #11989:
URL: https://github.com/apache/druid/pull/11989#issuecomment-984247964


   > @yuanlihan thanks for the explanation. Your last comment is what I was looking for 🙂 I haven't read code yet, so perhaps my questions are very basic. So based on your comment, I suppose the problem you want to solve is that the current result-level cache could be not much useful if your query involves realtime segments? And the new post-merge cache can address it by caching only the merge results of historical results? This sounds reasonable to me. It would be nice if you can add a summarized version of the answers to these questions in the PR description. It will help others understand what problem you are solving.
   
   Hi @jihoonson, thanks for looking into this.
   
   I have updated the PR description for design review and your feedback will be appreciated.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] paul-rogers edited a comment on pull request #11989: Add segment merged result cache for broker

Posted by GitBox <gi...@apache.org>.
paul-rogers edited a comment on pull request #11989:
URL: https://github.com/apache/druid/pull/11989#issuecomment-982054041


   @yuanlihan, cool feature! Here is one question you might want to explain: what is that use case that will be improved by this change?
   
   The feature would seem to optimize queries that hit exactly the same data every time. How common is this? It might solve a use case that you have, say, 100 dashboards, each of which will issue the same query. One of them runs the real query, 99 fetch from the cache.
   
   Here's a similar use case I've seen many times, but would *not* benefit from a post-merge cache: a dashboard wants to show the "last 10 minutes" of data, and we have minute-grain cubes (segments). Every query shifts the time range by one minute which will add a bit more new data, and exclude a bit of old data. With this design, we cache the post-merge data, so each shift in the time window will result in a new result set to cache.
   
   Can this feature instead cache pre-merge results? That way, we cache each minute slice once. The merge is still needed to gather up the slices required by the query. But, that merge is in-memory and should be pretty fast. The result would be, in this example, caching 1/10 the amount of data compared to caching post-merge data. And, less load on the historicals since we don't hit them each time the query time range shifts forward one minute.
   
   In short, please explain your use case a bit more so we an understand the goal of this enhancement.
   
   Edit: I'm told we may have some of the above. Historicals can already cache per-segment results. In a system with large numbers of segments, and where the query hits a good percentage of those, I'm told the merge cost can be high. Is it the merge cost that this PR seeks to avoid?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on pull request #11989: Add segment merged result cache for broker

Posted by GitBox <gi...@apache.org>.
jihoonson commented on pull request #11989:
URL: https://github.com/apache/druid/pull/11989#issuecomment-978118737


   Hi @yuanlihan, can you add some more details of the new feature, such as what it does, how it works, and how it is different from other caches? Ideally, this all should be documented in user docs. I think we can add docs as a follow-up of this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on pull request #11989: Add segment merged result cache for broker

Posted by GitBox <gi...@apache.org>.
jihoonson commented on pull request #11989:
URL: https://github.com/apache/druid/pull/11989#issuecomment-983963886


   @yuanlihan thanks for the explanation. Your last comment is what I was looking for :slightly_smiling_face: I haven't read code yet, so perhaps my questions are very basic. So based on your comment, I suppose the problem you want to solve is that the current result-level cache could be not much useful if your query involves realtime segments? And the new post-merge cache can address it by caching only the merge results of historical results? This sounds reasonable to me. It would be nice if you can add a summarized version of the answers to these questions in the PR description. It will help others understand what problem you are solving.
   
   > Can this feature instead cache pre-merge results? That way, we cache each minute slice once. The merge is still needed to gather up the slices required by the query. But, that merge is in-memory and should be pretty fast. The result would be, in this example, caching 1/10 the amount of data compared to caching post-merge data. And, less load on the historicals since we don't hit them each time the query time range shifts forward one minute.
   
   This sounds like the segment-level cache on broker to me too. We already support it (https://druid.apache.org/docs/0.22.0/querying/using-caching.html#enabling-query-caching-on-brokers).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] paul-rogers commented on pull request #11989: Add segment merged result cache for broker

Posted by GitBox <gi...@apache.org>.
paul-rogers commented on pull request #11989:
URL: https://github.com/apache/druid/pull/11989#issuecomment-982054041


   @yuanlihan, cool feature! Here is one question you might want to explain: what is that use case that will be improved by this change?
   
   The feature would seem to optimize queries that hit exactly the same data every time. How common is this? It might solve a use case that you have, say, 100 dashboards, each of which will issue the same query. One of them runs the real query, 99 fetch from the cache.
   
   Here's a similar use case I've seen many times, but would *not* benefit from a post-merge cache: a dashboard wants to show the "last 10 minutes" of data, and we have minute-grain cubes (segments). Every query shifts the time range by one minute which will add a bit more new data, and exclude a bit of old data. With this design, we cache the post-merge data, so each shift in the time window will result in a new result set to cache.
   
   Can this feature instead cache pre-merge results? That way, we cache each minute slice once. The merge is still needed to gather up the slices required by the query. But, that merge is in-memory and should be pretty fast. The result would be, in this example, caching 1/10 the amount of data compared to caching post-merge data. And, less load on the historicals since we don't hit them each time the query time range shifts forward one minute.
   
   In short, please explain your use case a bit more so we an understand the goal of this enhancement.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] yuanlihan commented on pull request #11989: Add segment merged result cache for broker

Posted by GitBox <gi...@apache.org>.
yuanlihan commented on pull request #11989:
URL: https://github.com/apache/druid/pull/11989#issuecomment-978676557


   > Hi @yuanlihan, can you add some more details of the new feature, such as what it does, how it works, and how it is different from other caches? Ideally, this all should be documented in user docs. I think we can add docs as a follow-up of this PR.
   
   sure, will document it later


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson edited a comment on pull request #11989: Add segment merged result cache for broker

Posted by GitBox <gi...@apache.org>.
jihoonson edited a comment on pull request #11989:
URL: https://github.com/apache/druid/pull/11989#issuecomment-980837329


   > It's expected as a better cache option for Broker in some use cases.
   
   @yuanlihan perhaps I should have been clearer. More details of this new feature will clarify what problem you want to solve and help reviewers understand your intention better. For example, in your statement quoted above, "better" is quite ambiguous. You can explain in terms of what the new cache is better than others. You can add user-facing docs later.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on pull request #11989: Add segment merged result cache for broker

Posted by GitBox <gi...@apache.org>.
jihoonson commented on pull request #11989:
URL: https://github.com/apache/druid/pull/11989#issuecomment-980837329


   > It's expected as a better cache option for Broker in some use cases.
   
   @yuanlihan perhaps I should have been clearer. More details of this new feature will clarify what problem you want to solve and help reviewers understand your intention better. For example, in your statement quoted above, "better" is quite ambiguous. you can explain in terms of what the new cache is better than others. You can add user-facing docs later.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] yuanlihan edited a comment on pull request #11989: Add segment merged result cache for broker

Posted by GitBox <gi...@apache.org>.
yuanlihan edited a comment on pull request #11989:
URL: https://github.com/apache/druid/pull/11989#issuecomment-983815418


   Hi @paul-rogers, thanks for the comment.
   
   >what is that use case that will be improved by this change?
   
   This cache option can be useful for queries need to scan both historical and realtime segments and with time range filter, like the day so far or this week/month/year so far.
   
   >The feature would seem to optimize queries that hit exactly the same data every time. How common is this? It might solve a use case that you have, say, 100 dashboards, each of which will issue the same query. One of them runs the real query, 99 fetch from the cache.
   
   Let's say there's a dashboard with a today-so-far time range filter. It issues the same query to fetch realtime metric every minute. The cache feature would work like
   
   1. At the moment, there are 10 historical segments and 2 realtime segments needed to be scanned. This cache feature will populate the merged result of the 10 historical segments to the cache. 
   2. One minute later, as you described, the next query will fetch the merged result of the 10 historical segments and merge with result from real query. 
   3. A few moments later, let's assume that another 2 new segments served on Historical, which means we need to scan 14 historical segments and 2 realtime segments. It'll prune the query and issue query for the other 2 historical segments and 2 realtime segments. Then it'll merge the previous cached result with the result of 2 new historical segments and populate the merged result of 14 historical segments to the cache. Last, it'll return the merged result of historical and realtime segments.
   4. We can reuse the cached merged result, as long as the cached merged result contains the **_subset_** of historical segments needed to be scanned.
   
   Is common? Hmm, It should be better if we have an efficient cache option for this use case.
   
   >Here's a similar use case I've seen many times, but would not benefit from a post-merge cache: a dashboard wants to show the "last 10 minutes" of data, and we have minute-grain cubes (segments). Every query shifts the time range by one minute which will add a bit more new data, and exclude a bit of old data. With this design, we cache the post-merge data, so each shift in the time window will result in a new result set to cache.
   
   The minute-grain segment sounds too strict to Druid. In my experience, the query should be fast if it needs to show "last 10 minutes" data since the last 10 minutes' data usually held  by peon workers. 
   While this new cache feature should also be beneficial to fixed query with time filler, like last several hours or last one day/week. Let's assume that the query granularity is like `PT5M` for last-one-hour filter, `PT15M` for last-several-hours filter, and so on. And if the client is willing to align the start of the time interval filter based on the query granularity, then, every 5 minutes(or every 15 minutes), the fixed query will be like a query with fixed timestamp so far filter, which can potentially leverage the new cache feature.
   
   >Can this feature instead cache pre-merge results? That way, we cache each minute slice once. The merge is still needed to gather up the slices required by the query. But, that merge is in-memory and should be pretty fast. The result would be, in this example, caching 1/10 the amount of data compared to caching post-merge data. And, less load on the historicals since we don't hit them each time the query time range shifts forward one minute.
   
   As far as I observed, fetching a lot of cached entries from cache is not very efficient, especially for many large size cache entries. The pre-merge result caching capability you described sounds like the segment level cache on broker. While if the segment granularity is much larger than the query granularity, like with 1 hour segment granularity and 1 minute query granularity, then the segment level caching will not work. So I see what you mentioned. But I'm not sure whether it's efficient to have query granularity level slices caching feature. It's a nice point to discuss.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] yuanlihan commented on pull request #11989: Add segment merged result cache for broker

Posted by GitBox <gi...@apache.org>.
yuanlihan commented on pull request #11989:
URL: https://github.com/apache/druid/pull/11989#issuecomment-983815418


   Hi @paul-rogers, thanks for the comment.
   
   >what is that use case that will be improved by this change?
   
   This cache option can be useful for queries need to scan both historical and realtime segments and with time range filter, like the day so far or this week/month/year so far.
   
   >The feature would seem to optimize queries that hit exactly the same data every time. How common is this? It might solve a use case that you have, say, 100 dashboards, each of which will issue the same query. One of them runs the real query, 99 fetch from the cache.
   
   Let's say there's a dashboard with a today-so-far time range filter. It issues the same query to fetch realtime metric every minute. The cache feature would work like
   
   1. At the moment, there are 10 historical segments and 2 realtime segments needed to be scanned. This cache feature will populate the merged result of the 10 historical segments to the cache. 
   2. One minute later, as you described, the next query will fetch the merged result of the 10 historical segments and merge with result from real query. 
   3. A few moments later, let's assume that another 2 new segments served on Historical, which means we need to scan 14 historical segments and 2 realtime segments. It'll prune the query and issue query for the other 2 historical segments and 2 realtime segments. Then it'll merge the previous cached result with the result of 2 new historical segments and populate the merged result of 14 historical segments to the cache. Last, it'll return the merged result of historical and realtime segments.
   4. We can reuse the cached merged result, as long as the cached merged result contains the **_subset_** of historical segments needed to be scanned.
   
   Is common? Hmm, It should be better if we have an efficient cache option for this use case.
   
   >Here's a similar use case I've seen many times, but would not benefit from a post-merge cache: a dashboard wants to show the "last 10 minutes" of data, and we have minute-grain cubes (segments). Every query shifts the time range by one minute which will add a bit more new data, and exclude a bit of old data. With this design, we cache the post-merge data, so each shift in the time window will result in a new result set to cache.
   
   The minute-grain segment sounds too strict to Druid. In my experience, the query should be fast if it needs to show "last 10 minutes" data since the last 10 minutes' data usually held  by peon workers. 
   While this new cache feature should also be beneficial to fixed query with time filler, like last several hours or last one day/week. Let's assume that the query granularity is like `PT5M` for last-one-hour filter, `PT15M` for last-several-hours filter, and so on. And if the client is willing to align the start of the time interval filter based on the query granularity, then, every 5 minutes(or every 15 minutes), the fixed query will be like a query with fixed timestamp so far filter, which can potentially leverage the new cache feature.
   
   >Can this feature instead cache pre-merge results? That way, we cache each minute slice once. The merge is still needed to gather up the slices required by the query. But, that merge is in-memory and should be pretty fast. The result would be, in this example, caching 1/10 the amount of data compared to caching post-merge data. And, less load on the historicals since we don't hit them each time the query time range shifts forward one minute.
   
   As far as I observed, fetching a lot of cached entries from cache is not very efficient, especially for many large size cache entries. The pre-merge result caching capability you described sounds like the segment level cache on broker. While if the segment granularity is much larger than the query granularity, like with 1 hour segment granularity and 1 minute query granularity, then the segment level will not work. So I see what you mentioned. But I'm not sure whether it's efficient to have query granularity level slices caching feature. It's a nice point to discuss.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org