You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Erick Erickson (JIRA)" <ji...@apache.org> on 2015/06/03 02:13:49 UTC

[jira] [Comment Edited] (SOLR-7571) Return metrics with update requests to allow clients to self-throttle

    [ https://issues.apache.org/jira/browse/SOLR-7571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14570015#comment-14570015 ] 

Erick Erickson edited comment on SOLR-7571 at 6/3/15 12:13 AM:
---------------------------------------------------------------

Otis:

Good suggestion, thanks! In the scenario I was thinking about the metrics would come back in the result from each update, forcing the client to look to see whether it should self-throttle, which would kind of leave updating by posting files out in the cold.

On tricky bit here is that say the "metric" was the number of threads ConcurrentUpdateSolrClient had outstandng (on each leader). You'd have to return either the entire list each time (possibly 100s of leaders) or the max threads outstanding or.... Not quite sure how providing the info through JMX would deal with that, any ideas?

What about failing if whatever metric chosen is exceeded? Let's say its' the number of outstanding threads CUSC has open. Then, instead of trying to open up yet _another_ thread, fail with a "SLOW_DOWN_PLEASE" exception, the client would then, say, wait a little while and re-submit the request or some such. I'm not at all sure how that would play in a situation where you had 5 shards and only _one_ of them hit this condition though.


was (Author: erickerickson):
Otis:

Good suggestion, thanks! In the scenario I was thinking about the metrics would come back in the result from each update, forcing the client to look to see whether it should self-throttle, which would kind of leave updating by posting files out in the cold.

On tricky bit here is that say the "metric" was the number of threads ConcurrentUpdateSolrClient had outstandng (on each leader). You'd have to return either the entire list each time (possibly 100s of leaders) or the max threads outstanding or.... Not quite sure how providing the info through JMX would deal with that, any ideas?

> Return metrics with update requests to allow clients to self-throttle
> ---------------------------------------------------------------------
>
>                 Key: SOLR-7571
>                 URL: https://issues.apache.org/jira/browse/SOLR-7571
>             Project: Solr
>          Issue Type: Improvement
>    Affects Versions: 4.10.3
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>
> I've assigned this to myself to keep track of it, anyone who wants please feel free to take this.
> I've recently seen a setup with 10 shards and 4 replicas. The SolrJ client (and post.jar for json files for that matter) firehose updates (150 separate threads in total) at Solr. Eventually, replicas (not leaders) go into recovery and the state cascades and eventually the entire cluster becomes unusable. SOLR-5850 delays the behavior, but it still occurs. There are no errors in the follower's logs this is leader-initiated-recovery because of a timeout.
> I think the root problem is that the client is just sending too many requests to the cluster, and ConcurrentUpdateSolrClient/Server (used by the leader to distribute update requests to all the followers) (this was observed in Solr 4.10.3+).  I see thread counts of 500+ when this happens.
> So assuming that this is the root cause, the obvious "cure" is "don't index that fast". This is unsatisfactory since "that fast" is variable, the only recourse is to set that threshold low enough that the Solr cluster isn't being driven as fast is it can be.
> We should provide some mechanism for having the client throttle itself. The number of outstanding update threads is one possibility. The client could then slow down sending updates to Solr. 
> I'm not sure there's a good way to deal with this on the server. Once the timeout is encountered, you don't know whether the doc has actually been indexed on the follower (actually, in this case it _is_ indexed, it just take a while). Ideally we'd just manage it all magically, but an alternative to let clients dynamically throttle themselves seems do-able.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org