You are viewing a plain text version of this content. The canonical link for it is here.

Posted to notifications@couchdb.apache.org by eiri <gi...@git.apache.org> on 2016/08/05 15:49:18 UTC

[GitHub] couchdb-couch pull request #190: Unblock waiting processes during idle proce...

GitHub user eiri opened a pull request:

    https://github.com/apache/couchdb-couch/pull/190

    Unblock waiting processes during idle process remove

    To clarify the description of the problem here is some terminology used in the process manager:
    
    - idle process - a running os process with no associated query server's client
    - assign process - set client (monitoring ref) to idle os process
    - teach ddoc - an initialization step that stores ddoc in couchjs cache
    - return process - mark process as an idle by setting its client (monitoring reference) to `undefined`
    - remove process - stop os process and decrease process counter
    - soft limit - number of os processes as per proc counter after which process manager starts to remove idle processes
    - hard limit - number of os processes as per proc counter after which process manager stops to spawn new processes. new requests placed in a waiting list
    - flush waiters - walk waiting list and start new process if process count below hard limit
    
    ### Problem
    
    Process manager flushes waiters only when it catches os process 'EXIT' or when process returned. Since it unlinks the processes before removing them, there are no waiters flush during soft limit trimming. That means once request got into waiting list its only chance to be flushed is for idle process count go over soft limit, then no new request to come after that and wait for some of the remaining client to return process. In sense waiters have the lowest priority in assigning, so we can have a situation when there are number of idle processes, but the waiters are still blocked.
    
    Commit that added `maybe_assign_proc/2` during return process tried to tackle it by directly assigning fresh idle process to one of the waiters, but since it doesn't ran process through teach ddoc this leads to crash with `query_protocol_error`.
    
    ### Solution
    
    I suggest to flash waiters on remove process instead of return process. That way waiters will always be getting first free process, will not be waiting for other client to return process and will not get superseded by new requests coming in-between.
    
    COUCHDB-3095

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cloudant/couchdb-couch 66640-fix-proc_manager-limits

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/couchdb-couch/pull/190.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #190
    
----
commit 10e6d81d67771c25245d14bd517287c7e84c8eaa
Author: Eric Avdey <ei...@eiri.ca>
Date:   2016-08-05T04:05:19Z

    Revert "Assign waiter on return_proc in couch_proc_manager"
    
    This reverts commit 80f8330a8dd99c29f0bcc8d2d776ee04a54815b9.

commit 0b6ab0bd182fcbc0643b71fec229054871ca1b60
Author: Eric Avdey <ei...@eiri.ca>
Date:   2016-08-05T15:39:43Z

    Flush waiters during process remove

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] couchdb-couch issue #190: Unblock waiting processes during idle process remo...

Posted by eiri <gi...@git.apache.org>.

Github user eiri commented on the issue:

    https://github.com/apache/couchdb-couch/pull/190
  
    @davisp : I afraid I might not explained it well, so I'm going to try again, just to be sure we are on the same page.
    
    Basically, it's about fairness of the distribution of resources. Figuratively speaking we have two pools here: idle process pool controlled by soft limit and "processes that we can spawn" pool, controlled by hard limit. All incomings are competitors for the idle pool and if they loose they are becoming competitors for spawanable pool. If they loose there as well, they are becoming waiters.
    
    Now, if we'll be flushing waiters on proc remove, then that'd mean that we are still treating them as competitors for spawanable pool. But if we'll be assigning waiters in return proc, then we'd be treating them as competitors for the idle pool, i.e. new incomings will be getting lower chance to get resource, than waiters had when they've been incomings themselves. As a matter of fact, once hard limit hit we'll be changing allocation rule all together - all new incomings will be dropped into waiters list until it cleaned.
    
    So those two approaches: where to flush waiters, on proc return or on proc remove; will behave differently on bottlenecked system, i.e. when incoming rate exceeds proc return rate. After the hard limit hit, flushing waiters in return proc will be making a serving delay for each new incoming longer and longer, until we'll get burst of timeouts from overgrown waiting list. Flushing waiters in proc remove will increase chance for incoming to time out, but will keep serving "lucky" ddocs with the same delay. In sense this is a question of how we prefer to handle overflow: by degrading general performance or by increasing request dropping.
    
    I believe increase of dropping rate was intended approach of [the original design](https://github.com/apache/couchdb-couch/commit/db58e794f937a52b6b61c964942e56afa7d03d8b), it just waiters' flush was triggered in the wrong place, so the waiters haven't had fair chance on spawanable pool. But [adding  maybe_assign_proc](https://github.com/apache/couchdb-couch/commit/80f8330a8dd99c29f0bcc8d2d776ee04a54815b9) changed how things work, it just missed the teaching bit.
    
    In my opinion the original approach in terms of fairness was the correct one, so I'm trying to restore it here. I guess it can be improved if waiters will call `maybe_teaching_ddoc` on proc return (but without consequent assigning), so they'll have better chance to be served on a next try.
    
    Sorry for the lengthy comment, I spent fair amount of thinking on this, not just cowboying through, so I want us to be in agreement on which approach we prefer.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] couchdb-couch issue #190: Unblock waiting processes during idle process remo...

Posted by eiri <gi...@git.apache.org>.

Github user eiri commented on the issue:

https://github.com/apache/couchdb-couch/pull/190

Thank you for the explanation, I definitely haven't thought about this as about rubber FIFO, but rather as some variation of [CoDel](https://en.wikipedia.org/wiki/CoDel#The_CoDel_algorithm). I guess the way we are doing `find_proc` is what confused me. I mean, for FIFO I'd expect to check if we at `hard_limit` before attempt to find idle process, not after. So idea about rearranging that beat sounds good to me.

About filtered replication issue: when we are [filtering through changes](https://github.com/apache/couchdb-couch/blob/master/src/couch_query_servers.erl#L390) we are requesting proc for each doc id and then [~immediately return it back](https://github.com/apache/couchdb-couch/blob/master/src/couch_query_servers.erl#L420)

Right now, if we are balancing on hard_limit edge, the filter has pretty good chance to get this proc back, because it'll ask for it right away and the only idle process guarantee to know about the filter, so it'll not try to spawn.

When the waiters will be stealing those procs, the filter will have to go into the tail of the waiters list, so there will be the response delay. Then, once it'll get out of waiters and request proc again, it might get one right away if meanwhile we went under `hard_limit`. So the filtered replications will be bumpy, i.e. processing changes feeds with seemingly random delays. The filtered replication is main concern because it's pretty much exclusive source of current `query_protocol_error` errors we are observing now, so empirically it's the most involved place.

For os proc growing RAM I'd vote for just periodic bumping `threshold_ts`, this is what ops doing now with reload(), except it's manual. Adding ddoc's de-teaching sounds complex and with requests been random and ddoc sized different, I'd expect the procs to show weird variations on how much memory each takes.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] couchdb-couch issue #190: Unblock waiting processes during idle process remo...

Posted by eiri <gi...@git.apache.org>.

Github user eiri commented on the issue:

    https://github.com/apache/couchdb-couch/pull/190
  
    Closing in favour of #191 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] couchdb-couch issue #190: Unblock waiting processes during idle process remo...

Posted by eiri <gi...@git.apache.org>.

Github user eiri commented on the issue:

    https://github.com/apache/couchdb-couch/pull/190
  
    Okey-dokey, I'll redo this with `maybe_teach_ddoc`.
    
    Should I add a feature to disable hard limit as well? Right now it's not possible, unless setting it to something with a lot of zeros. Or should it also go into separate PR to not to complicate things?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] couchdb-couch pull request #190: Unblock waiting processes during idle proce...

Posted by davisp <gi...@git.apache.org>.

Github user davisp commented on a diff in the pull request:

    https://github.com/apache/couchdb-couch/pull/190#discussion_r74157712
  
    --- Diff: src/couch_proc_manager.erl ---
    @@ -436,9 +436,8 @@ assign_proc(#client{}=Client, #proc_int{client=undefined}=Proc) ->
         assign_proc(Pid, Proc).
     
     
    -return_proc(#state{} = State, #proc_int{} = ProcInt) ->
    -    #proc_int{pid = Pid, lang = Lang} = ProcInt,
    -    NewState = case is_process_alive(Pid) of true ->
    +return_proc(#state{} = State, #proc_int{pid = Pid} = ProcInt) ->
    +    case is_process_alive(Pid) of true ->
             case ProcInt#proc_int.t0 < State#state.threshold_ts of
    --- End diff --
    
    Specifically, the false clause on this case is where I think this is wrong. The threshold_ts is just for generational collection. If the proc is in the current generation then its just shoved back in the pool where it'd have to wait for an idle timeout before removing if we're over soft limit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] couchdb-couch issue #190: Unblock waiting processes during idle process remo...

Posted by davisp <gi...@git.apache.org>.

Github user davisp commented on the issue:

    https://github.com/apache/couchdb-couch/pull/190
  
    And flush_waiters would have to change and the handle_call for get_proc would need to move to a helper function and so on and such forth. But if it simplifies understanding and fixes the issue with that queue breaking I'd rather do that than just put another bandaid on the whole thing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] couchdb-couch issue #190: Unblock waiting processes during idle process remo...

Posted by davisp <gi...@git.apache.org>.

Github user davisp commented on the issue:

    https://github.com/apache/couchdb-couch/pull/190
  
    I'm pretty sure this is wrong. remove_proc is only called when a proc is exiting either due to an error or when it times out after idle. So your change still delays allocating waiters until a period of ?TIMEOUT inactivity.
    
    I'd go back to the previous and then just add a "maybe_teach_ddoc" function that checks if the ddoc key exists on the ProcInt record and if not uses teach_ddoc to fix that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] couchdb-couch issue #190: Unblock waiting processes during idle process remo...

Posted by davisp <gi...@git.apache.org>.

Github user davisp commented on the issue:

    https://github.com/apache/couchdb-couch/pull/190
  
    Also I was thinking a bit more complicated than just adding maybe_teach_ddoc. I was thinking that every client request just gets put into the waiters list and then we call flush_waiters in all the places that we need to flush.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] couchdb-couch issue #190: Unblock waiting processes during idle process remo...

Posted by davisp <gi...@git.apache.org>.

Github user davisp commented on the issue:

https://github.com/apache/couchdb-couch/pull/190

Thanks for the explanation of your thoughts there as that shows we're definitely thinking about this differently. For reference, here's my description of what I think should be going on:

There is a simple FIFO queue of clients requesting processes from a single pool. The size of the pool can grow to $hard_limit in size before we make clients wait. When load on the pool decreases the pool will shrink until it reaches $soft_limit. The $soft_limit parameter is a configuration knob for how much of a spike in load we can withstand before moving to the relatively slower "spawn new process for client" mode. The rate at which the pool shrinks from $hard_limit to $soft_limit is based on the process idle timeout when it becomes a candidate for removal. When a client requests a process for use with a ddoc we have an optimization to return one that already knows about the ddoc to avoid paying the penalty for sending the ddoc to a process that doesn't know about it.

The bug as I understand it is that our FIFO queue of clients is broken as sometimes a new client can jump the queue when there are waiters already present. This leads to issues where we have timeouts in clients because of steady state load scenarios that prevent waiters from ever being processed.

I'm not sure on your comment about how stealing processes would make filter_view bumpy. I do agree that in busy servers we would see design doc accumulation in os processes though. But there was never any concern about that because the assumption is that ddoc memory usage is trivial in the grand scheme of things. However, I have seen clusters with 2-3K couchjs processes holding a significant amount of RAM. To combat that I would look at improving teach_ddoc to start removing unused ddoc keys with a new query server command (or alternatively take the nuclear option and just replace the processes every so often like bump_threshold_ts every 5m).

Thinking about it a bit more what might make more sense is to rearrange things to make those queues a bit more obvious. Instead of attempting to assign a process to a client and only moving to the waiters list on failure perhaps we just take all clients and add to the waiters list and then call flush_waiters every time we handle a message that affects waiters or the pool membership?

Though I would like to hear more what you were thinking about the stealing processes bit. I've certainly never considered dropping waiters in favor of new clients here. On first blush I'd say that doesn't sound right but perhaps I'm not seeing something you reasoned out.

[GitHub] couchdb-couch issue #190: Unblock waiting processes during idle process remo...

Posted by davisp <gi...@git.apache.org>.

Github user davisp commented on the issue:

    https://github.com/apache/couchdb-couch/pull/190
  
    I think setting hard limit to a huge number is close enough to disabling it given that there are maximum numbers of os processes and the like.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] couchdb-couch pull request #190: Unblock waiting processes during idle proce...

Posted by eiri <gi...@git.apache.org>.

Github user eiri commented on a diff in the pull request:

    https://github.com/apache/couchdb-couch/pull/190#discussion_r74180469
  
    --- Diff: src/couch_proc_manager.erl ---
    @@ -436,9 +436,8 @@ assign_proc(#client{}=Client, #proc_int{client=undefined}=Proc) ->
         assign_proc(Pid, Proc).
     
     
    -return_proc(#state{} = State, #proc_int{} = ProcInt) ->
    -    #proc_int{pid = Pid, lang = Lang} = ProcInt,
    -    NewState = case is_process_alive(Pid) of true ->
    +return_proc(#state{} = State, #proc_int{pid = Pid} = ProcInt) ->
    +    case is_process_alive(Pid) of true ->
             case ProcInt#proc_int.t0 < State#state.threshold_ts of
    --- End diff --
    
    But this is pretty much the same as it is right now, we are always [flushing](https://github.com/apache/couchdb-couch/blob/master/src/couch_proc_manager.erl#L455) in return_proc, but in `false` clause it's meaningless, because state's counts hasn't been updated.
    
    Out of curiosity, how this generational collection controlled? I understand what it works, but is there anything that uses it or is it just for manual `couch_proc_manager:reload()`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] couchdb-couch issue #190: Unblock waiting processes during idle process remo...

Posted by eiri <gi...@git.apache.org>.

Github user eiri commented on the issue:

    https://github.com/apache/couchdb-couch/pull/190
  
    @davisp Can you please take a look if this looks like a right solution for you? I know it solves the issue with "query_protocol_error", doesn't lead to waiters timeouts and doesn't generally break process manager, but just want to be sure I'm not missing something obvious here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] couchdb-couch issue #190: Unblock waiting processes during idle process remo...

Posted by eiri <gi...@git.apache.org>.

Github user eiri commented on the issue:

    https://github.com/apache/couchdb-couch/pull/190
  
    Well, my reasoning was that the waiters got into wait because hard limit was reached, i.e. we haven't found idle process that already know about requestor's ddoc, we haven't found idle process we can teach about ddoc and we don't have resources to spawn new os process. But when we are resolving waiters in `return_proc` it's not because resources got free, but because we found a way to share the allocated resources even more, i.e. we are blocking for one reason, but resolving for another.
    
    As I imagine it, on busy clusters this pattern will make each os process eventually accumulate all the existing ddocs and then requests to steal processes from each other more and more frequently, especially in [filter_view](https://github.com/cloudant/couchdb-couch/blob/master/src/couch_query_servers.erl#L417), which would make filtered replications rather bumpy.
    
    But if you think those are futile concerns, then sure, I'll close this PR and do another one with maybe_teach_ddoc in maybe_assign_proc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] couchdb-couch issue #190: Unblock waiting processes during idle process remo...

Posted by davisp <gi...@git.apache.org>.

Github user davisp commented on the issue:

    https://github.com/apache/couchdb-couch/pull/190
  
    Cool. Sounds like were in agreement. As for CoDel, I'm never nearly that fancy on purpose. \U0001f603
    
    That makes more sense on the filtered replication. However my answer to that would be two approaches. First, if you're having issues, increase your hard limit configuration (or disable it completely if you have enough RAM to absorb load spikes). Second, fix the replicator so its not being stupid and requesting an OS process per doc and instead passes through a batch and/or shares the os process (or group of processes) between replication workers.
    
    I agree that bumping the threshold ts would be easiest as well as it also doesn't require any sort of query server protocol change. Seems like adding a max age parameter to the couch_proc_manager config would be fairly simple to add to the lifetime check on return_proc. Though that might be a different ticket than reworking the waiters queue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] couchdb-couch pull request #190: Unblock waiting processes during idle proce...

Posted by eiri <gi...@git.apache.org>.

Github user eiri closed the pull request at:

    https://github.com/apache/couchdb-couch/pull/190


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---