You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Anshum Gupta (JIRA)" <ji...@apache.org> on 2014/09/05 22:31:29 UTC

[jira] [Updated] (SOLR-5986) Don't allow runaway queries from harming Solr cluster health or search performance

     [ https://issues.apache.org/jira/browse/SOLR-5986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Anshum Gupta updated SOLR-5986:
-------------------------------
    Attachment: SOLR-5986.patch

New patch with the same approach but a few things fixed.

This uses timeAllowed (also used by the TimeLimitingCollector) to timeout queries during rewrite. Here's the fine print though, both the TimeLimitingCollected manage their own counter for the clock i.e. The maximum time that a query could run for is (2 * timeAllowed + Delta), where Delta is the time used for anything other than query expansion and collection.

I think this should be ok, specially considering the intention is to make sure that the request is killed and doesn't run forever.
There's always room for improvement though and we might want to share the same counter/offset/timeout so it's more predictable.

P.S: I'm still running the pre-commit to check if I'm missing something and would then run the solr+lucene tests again.

> Don't allow runaway queries from harming Solr cluster health or search performance
> ----------------------------------------------------------------------------------
>
>                 Key: SOLR-5986
>                 URL: https://issues.apache.org/jira/browse/SOLR-5986
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>            Reporter: Steve Davids
>            Assignee: Anshum Gupta
>            Priority: Critical
>             Fix For: 4.10
>
>         Attachments: SOLR-5986.patch, SOLR-5986.patch
>
>
> The intent of this ticket is to have all distributed search requests stop wasting CPU cycles on requests that have already timed out or are so complicated that they won't be able to execute. We have come across a case where a nasty wildcard query within a proximity clause was causing the cluster to enumerate terms for hours even though the query timeout was set to minutes. This caused a noticeable slowdown within the system which made us restart the replicas that happened to service that one request, the worst case scenario are users with a relatively low zk timeout value will have nodes start dropping from the cluster due to long GC pauses.
> [~amccurry] Built a mechanism into Apache Blur to help with the issue in BLUR-142 (see commit comment for code, though look at the latest code on the trunk for newer bug fixes).
> Solr should be able to either prevent these problematic queries from running by some heuristic (possibly estimated size of heap usage) or be able to execute a thread interrupt on all query threads once the time threshold is met. This issue mirrors what others have discussed on the mailing list: http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200903.mbox/%3C856ac15f0903272054q2dbdbd19kea3c5ba9e105b9d8@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org