You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2020/08/17 21:21:33 UTC

[GitHub] [druid] kroeders opened a new issue #10294: Failed Query due to missing lookup on some servers

kroeders opened a new issue #10294:
URL: https://github.com/apache/druid/issues/10294


   ### Motivation
   
   Given a query with lookups, one server with a missing lookup can cause query execution to fail. When the broker distributes a query to historicals and realtime servers, if any one of those servers does not have the lookup, the query fails as a whole. Lookups can fail to load for a number of reasons, such as missing firewall rules, drivers or slow loading times for large, frequently updated lookups. These queries could be served if the broker considered lookup status when selecting servers for querying. 
   
   To reproduce this issue, load the druid-lookups-cached-global and create a database backed lookup. Launch an additional historical without the database driver and the lookup will fail to load on that historical. Queries using the lookup will fail altogether because of the one historical without the lookup. 
   
   ### Proposed Changes
   
   The proposal is to modify the broker to track the lookup status on historical and realtime servers and avoid routing queries to servers where relevant lookups are not loaded. This can be done by making server selection aware of the query and excluding servers without required lookups. 
   
   #### Tracking Lookup Status in Broker
   
   The coordinator is responsible for tracking lookups and ensuring they are updated on query servers, so it has the information on which version has been successfully loaded on each node. This is available through the nodeStatus API. The broker can periodically poll the coordinator’s nodeStatus API and maintain a local cache of lookup status on each query server. 
   
   Alternatively, the broker could poll the internal listener API on the query servers, but this repeats work that the coordinator already does. Other transportation mechanisms like zookeeper could also be used or the coordinator could push the information to the brokers. 
   
   #### Avoiding Query Servers without Lookup
   
   CachingClusteredClient is responsible for determining which servers fulfil a query. The process is to retrieve a set of segment/server mapping relevant to the query and then use a strategy to select servers for each segment. Server selection is not aware of the query. When filtering segments in TierSelectorStrategy before applying the ServerSelector strategy, the query could be considered to avoid query servers without required lookups. Default methods can be added to avoid breaking existing implementations. 
   
   Alternatively, the pick interface on the ServerSelector interfaces could be extended to add a Query parameter and avoid servers without relevant lookups.  Because this is an exceptional case, the servers could also be filtered in CachingClusteredClient before selection. Another alternative would involve handling the exception from the historical/realtime server and retrying the query for those segments. 
   
   #### Extracting Lookups from Queries
   
   Lookups specified as functions in SQL become virtual columns with a lookup expression or as the right join source for join queries. A new query runner could be added to extract the lookups, compare them with the servers and store this blocklist of servers in the query context. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] kroeders commented on issue #10294: Failed Query due to missing lookup on some servers

Posted by GitBox <gi...@apache.org>.
kroeders commented on issue #10294:
URL: https://github.com/apache/druid/issues/10294#issuecomment-698508429






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] kroeders commented on issue #10294: Failed Query due to missing lookup on some servers

Posted by GitBox <gi...@apache.org>.
kroeders commented on issue #10294:
URL: https://github.com/apache/druid/issues/10294#issuecomment-698913447


   hi @cesure ! if the lookup is loaded on at least one historical, then the query should succeed. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] cesure commented on issue #10294: Failed Query due to missing lookup on some servers

Posted by GitBox <gi...@apache.org>.
cesure commented on issue #10294:
URL: https://github.com/apache/druid/issues/10294#issuecomment-698802063


   Hi! I have a question regarding this issue: Will this also fix the problem we have when a Historical Node has already started up and the Broker sends queries to it but the Lookups are still loading which leads to failing queries?
   
   We use Lookups a lot and they are quite huge, so it takes some minutes to load them into memory and have them available for queries. During this time you find these log messages:
   
   `2020-09-25T08:20:56,385 INFO [NamespaceExtractionCacheManager-1] org.apache.druid.server.lookup.namespace.JdbcCacheGenerator - Finished loading 23083 values for namespace [JdbcExtractionNamespace{connectorConfig=DbConnectorConfig{...}] : org.apache.druid.server.lookup.namespace.cache.CacheScheduler$EntryImpl@2e360bcc`
   
   Thanks!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on issue #10294: Failed Query due to missing lookup on some servers

Posted by GitBox <gi...@apache.org>.
jihoonson commented on issue #10294:
URL: https://github.com/apache/druid/issues/10294#issuecomment-737436492


   Hey @kroeders, sorry it took long to get back to you. The idea LGTM overall, I left some comments on the PR.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] kroeders commented on issue #10294: Failed Query due to missing lookup on some servers

Posted by GitBox <gi...@apache.org>.
kroeders commented on issue #10294:
URL: https://github.com/apache/druid/issues/10294#issuecomment-698508429


   Sure! 
   One alternative would be to filter the servers in CachingClusteredClient to remove invalid servers before selecting servers for query. This is a little bit complicated because the ServerSelector provided from the Timeline is shared between queries, so the set passed to groupSegmentsByServer has to be copied and filtered. There could be something like a ServerFilter that get applied. However, there are changes needed to ServerSelector to allow it to be recreated without certain servers. From what I can tell, segments and related servers are all maintained across queries, so something would have to change around that point to get this functionality. 
   
   It might be useful to make a general purpose filter that excludes blocked servers for a given query. This could allow more general filtering based on server health, for example.
   
   I think adding Query to the pick calls is a pretty unobtrusive way to get this and maybe other filtering as well. 
   
   It looks like this is all in the same thread, I guess  the Query could be put into ThreadLocal storage, but then there are questions about when to clean up that data. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] cesure commented on issue #10294: Failed Query due to missing lookup on some servers

Posted by GitBox <gi...@apache.org>.
cesure commented on issue #10294:
URL: https://github.com/apache/druid/issues/10294#issuecomment-698802063


   Hi! I have a question regarding this issue: Will this also fix the problem we have when a Historical Node has already started up and the Broker sends queries to it but the Lookups are still loading which leads to failing queries?
   
   We use Lookups a lot and they are quite huge, so it takes some minutes to load them into memory and have them available for queries. During this time you find these log messages:
   
   `2020-09-25T08:20:56,385 INFO [NamespaceExtractionCacheManager-1] org.apache.druid.server.lookup.namespace.JdbcCacheGenerator - Finished loading 23083 values for namespace [JdbcExtractionNamespace{connectorConfig=DbConnectorConfig{...}] : org.apache.druid.server.lookup.namespace.cache.CacheScheduler$EntryImpl@2e360bcc`
   
   Thanks!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] dylwylie commented on issue #10294: Failed Query due to missing lookup on some servers

Posted by GitBox <gi...@apache.org>.
dylwylie commented on issue #10294:
URL: https://github.com/apache/druid/issues/10294#issuecomment-700139960


   It feels like it'd be much simpler to have the queryable nodes themselves take care of this.
   
   While starting up, a historical or realtime task shouldn't advertiser itself as available for queries until all lookups are loaded.
   
   What do you think @kroeders ?
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] kroeders commented on issue #10294: Failed Query due to missing lookup on some servers

Posted by GitBox <gi...@apache.org>.
kroeders commented on issue #10294:
URL: https://github.com/apache/druid/issues/10294#issuecomment-700229405


   Hi @dylwylie thanks for taking a look - Our original use case was focused on queryable nodes with failed lookups. The production cluster is around a few hundred queryable nodes and occasionally a lookup will fail to load on a few of them. This can happen for a variety of reasons, like database authentication issues for that particular host. Users may have a variety of queries which may or may not use a variety of lookups, so I'm reluctant to remove a queryable node that could serve other queries. If all lookups have to load before a queryable node announces itself, it also introduces a dangerous point of failure where adding a faulty lookup (or an old database backed lookup failing) could bring down the whole cluster. New lookups and regularly updated lookups introduce some complications too, because the node has already announced itself. 
   
   The reason I prefer the server selection approach is it only impacts queries with lookups and only does so in a positive way. All queries that already succeed will still succeed and some queries that currently fail will succeed. The cost is a few comparisons during the normal server selection process, which is only if the server selector strategy is used. The only change to core Druid itself to make that work is to make the selector aware of the query so it can route around nodes that aren't suitable for that query. 
   
   Do you think it would be worth the risk / lost capacity to other queries to filter at the data node level? Is there anything that seems particularly overcomplicated that you see in this approach?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] a2l007 commented on issue #10294: Failed Query due to missing lookup on some servers

Posted by GitBox <gi...@apache.org>.
a2l007 commented on issue #10294:
URL: https://github.com/apache/druid/issues/10294#issuecomment-698393448


   From an overall review of the PR #10428 , the design looks reasonable to me, because it minimizes the changes in core druid and separates out the lookup filtering changes into an extension. Once a queryableDruidServer for the first segment specific to a query is picked, the subsequent pick calls for that query might not need the query object for making the pick decision, so it would be good to have some kind of caching in the extension. 
   It looks useful to me, but I'll wait for other reviewers to take a look at this as well. Could you please add additional information on this issue regarding what were the alternate implementations considered in solving this problem?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] a2l007 commented on issue #10294: Failed Query due to missing lookup on some servers

Posted by GitBox <gi...@apache.org>.
a2l007 commented on issue #10294:
URL: https://github.com/apache/druid/issues/10294#issuecomment-700090133


   @jihoonson , @gianm  Do you have any comments on this issue and related PR: #10428
   I've tagged this under Design Review as the PR proposes to modify the core ServerSelectorStrategy interface.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] a2l007 commented on issue #10294: Failed Query due to missing lookup on some servers

Posted by GitBox <gi...@apache.org>.
a2l007 commented on issue #10294:
URL: https://github.com/apache/druid/issues/10294#issuecomment-698393448


   From an overall review of the PR #10428 , the design looks reasonable to me, because it minimizes the changes in core druid and separates out the lookup filtering changes into an extension. Once a queryableDruidServer for the first segment specific to a query is picked, the subsequent pick calls for that query might not need the query object for making the pick decision, so it would be good to have some kind of caching in the extension. 
   It looks useful to me, but I'll wait for other reviewers to take a look at this as well. Could you please add additional information on this issue regarding what were the alternate implementations considered in solving this problem?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] kroeders commented on issue #10294: Failed Query due to missing lookup on some servers

Posted by GitBox <gi...@apache.org>.
kroeders commented on issue #10294:
URL: https://github.com/apache/druid/issues/10294#issuecomment-697986177


   Hi @a2l007 @jihoonson I was wondering if you could have a look at the approach in the linked pull requests when you have a chance? The basic approach is to add an extension that provides a new ServerSelectorStrategy that filters out servers without related lookups. There would be one small change to the core code, which would be to add the Query as an optional parameter through the server selection codepath. The advantage is that a new ServerSelectorStrategy can be registered that applies a filter to the servers coming from TierSelectorStrategy and then delegates to an existing ServerSelectorStrategy, so there is minimal overhead and potentially the same approach could be extended to filter for other reasons. I don't think there is much overhead to this change and I'm going to test it on our staging cluster in the next few days. Thanks!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org