You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2021/12/01 16:35:38 UTC
[GitHub] [druid] yuanlihan edited a comment on pull request #11989: Add segment merged result cache for broker

yuanlihan edited a comment on pull request #11989:
URL: https://github.com/apache/druid/pull/11989#issuecomment-983815418


   Hi @paul-rogers, thanks for the comment.
   
   >what is that use case that will be improved by this change?
   
   This cache option can be useful for queries need to scan both historical and realtime segments and with time range filter, like the day so far or this week/month/year so far.
   
   >The feature would seem to optimize queries that hit exactly the same data every time. How common is this? It might solve a use case that you have, say, 100 dashboards, each of which will issue the same query. One of them runs the real query, 99 fetch from the cache.
   
   Let's say there's a dashboard with a today-so-far time range filter. It issues the same query to fetch realtime metric every minute. The cache feature would work like
   
   1. At the moment, there are 10 historical segments and 2 realtime segments needed to be scanned. This cache feature will populate the merged result of the 10 historical segments to the cache. 
   2. One minute later, as you described, the next query will fetch the merged result of the 10 historical segments and merge with result from real query. 
   3. A few moments later, let's assume that another 2 new segments served on Historical, which means we need to scan 14 historical segments and 2 realtime segments. It'll prune the query and issue query for the other 2 historical segments and 2 realtime segments. Then it'll merge the previous cached result with the result of 2 new historical segments and populate the merged result of 14 historical segments to the cache. Last, it'll return the merged result of historical and realtime segments.
   4. We can reuse the cached merged result, as long as the cached merged result contains the **_subset_** of historical segments needed to be scanned.
   
   Is common? Hmm, It should be better if we have an efficient cache option for this use case.
   
   >Here's a similar use case I've seen many times, but would not benefit from a post-merge cache: a dashboard wants to show the "last 10 minutes" of data, and we have minute-grain cubes (segments). Every query shifts the time range by one minute which will add a bit more new data, and exclude a bit of old data. With this design, we cache the post-merge data, so each shift in the time window will result in a new result set to cache.
   
   The minute-grain segment sounds too strict to Druid. In my experience, the query should be fast if it needs to show "last 10 minutes" data since the last 10 minutes' data usually held  by peon workers. 
   While this new cache feature should also be beneficial to fixed query with time filler, like last several hours or last one day/week. Let's assume that the query granularity is like `PT5M` for last-one-hour filter, `PT15M` for last-several-hours filter, and so on. And if the client is willing to align the start of the time interval filter based on the query granularity, then, every 5 minutes(or every 15 minutes), the fixed query will be like a query with fixed timestamp so far filter, which can potentially leverage the new cache feature.
   
   >Can this feature instead cache pre-merge results? That way, we cache each minute slice once. The merge is still needed to gather up the slices required by the query. But, that merge is in-memory and should be pretty fast. The result would be, in this example, caching 1/10 the amount of data compared to caching post-merge data. And, less load on the historicals since we don't hit them each time the query time range shifts forward one minute.
   
   As far as I observed, fetching a lot of cached entries from cache is not very efficient, especially for many large size cache entries. The pre-merge result caching capability you described sounds like the segment level cache on broker. While if the segment granularity is much larger than the query granularity, like with 1 hour segment granularity and 1 minute query granularity, then the segment level caching will not work. So I see what you mentioned. But I'm not sure whether it's efficient to have query granularity level slices caching feature. It's a nice point to discuss.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org