You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Steve Davids (JIRA)" <ji...@apache.org> on 2014/08/18 20:20:20 UTC
[jira] [Commented] (SOLR-5986) Don't allow runaway queries from harming Solr cluster health or search performance

    [ https://issues.apache.org/jira/browse/SOLR-5986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14100991#comment-14100991 ] 

Steve Davids commented on SOLR-5986:
------------------------------------

We came across the issue again and added a lot more probes to get a grasp on what exactly is happening, I believe further tickets might be necessary to address various pieces.

#1) We are setting the "timeout" request parameter which tells the TimeLimitingCollector to throw a TimeExceededException, though in our logs we see the error messages thrown after about an hour for one of the queries we tried, even though the timeout is set for a couple of minutes. This is presumably due to the query parsing taking about an hour and once the query is finally parsed and handed to the collector the TimeLimitingCollector immediately throws in exception. We should have something similar throw the same exception while in the query building phase (this way the partial results warnings will continue to just work). It looks like the current work is more in the realm of solving this issue which may fix the problems we saw described in #2.

#2) We set socket read timeouts on HTTPClient which causes the same query to be sent into the cluster multiple times giving it a slow, painful death. This is even more problematic while using the SolrJ API, what ends up happening from SolrJ's LBHttpSolrServer is that it will loop through *every* host in the cluster and if a socket read timeout happens it tries the next item in the list. Internally every single request made to the cluster from an outside SolrJ client will try to gather the results for all shards in the cluster, once a socket read timeout happens internal to the cluster the same retry logic will attempt to gather results from the next replica in the list. So, if we hypothetically had 10 shards with 3 replicas, and made a request from an outside client it would make 30 (external SolrJ call to each host to request a distributed search) * 30 (each host will be called at least once for the internal distributed request) = 900 overall requests (each individual search host will handle 30 requests). This should probably become it's own ticket to track, to either a) don't retry on a socket read timeout or b) specify a retry timeout of some sort in the LBHttpSolrServer (this is something we did internally for simplicity sake).

> Don't allow runaway queries from harming Solr cluster health or search performance
> ----------------------------------------------------------------------------------
>
>                 Key: SOLR-5986
>                 URL: https://issues.apache.org/jira/browse/SOLR-5986
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>            Reporter: Steve Davids
>            Assignee: Anshum Gupta
>            Priority: Critical
>             Fix For: 4.10
>
>         Attachments: SOLR-5986.patch
>
>
> The intent of this ticket is to have all distributed search requests stop wasting CPU cycles on requests that have already timed out or are so complicated that they won't be able to execute. We have come across a case where a nasty wildcard query within a proximity clause was causing the cluster to enumerate terms for hours even though the query timeout was set to minutes. This caused a noticeable slowdown within the system which made us restart the replicas that happened to service that one request, the worst case scenario are users with a relatively low zk timeout value will have nodes start dropping from the cluster due to long GC pauses.
> [~amccurry] Built a mechanism into Apache Blur to help with the issue in BLUR-142 (see commit comment for code, though look at the latest code on the trunk for newer bug fixes).
> Solr should be able to either prevent these problematic queries from running by some heuristic (possibly estimated size of heap usage) or be able to execute a thread interrupt on all query threads once the time threshold is met. This issue mirrors what others have discussed on the mailing list: http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200903.mbox/%3C856ac15f0903272054q2dbdbd19kea3c5ba9e105b9d8@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org