You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@openwhisk.apache.org by Tyson Norris <tn...@adobe.com> on 2017/05/01 15:54:41 UTC

concurrent requests on actions

Hi All -
I created this issue some time ago to discuss concurrent requests on actions: [1] Some people mentioned discussing on the mailing list so I wanted to start that discussion.

I’ve been doing some testing against this branch with Markus’s work on the new container pool: [2]
I believe there are a few open PRs in upstream related to this work, but this seemed like a reasonable place to test against a variety of the reactive invoker and pool changes - I’d be interested to hear if anyone disagrees.

Recently I ran some tests
- with “throughput.sh” in [3] using concurrency of 10 (it will also be interesting to test with the --rps option in loadtest...)
- using a change that checks actions for an annotation “max-concurrent” (in case there is some reason actions want to enforce current behavior of strict serial invocation per container?)
- when scheduling an actions against the pool, if there is a currently “busy” container with this action, AND the annotation is present for this action, AND concurrent requests < max-concurrent, the this container is used to invoke the action

Below is a summary (approx 10x throughput with concurrent requests) and I would like to get some feedback on:
- what are the cases for having actions that require container isolation per request? node is a good example that should NOT need this, but maybe there are cases where it is more important, e.g. if there are cases where stateful actions are used?
- log collection approach: I have not attempted to resolve log collection issues; I would expect that revising the log sentinel marker to include the activation ID would help, and logs stored with the activation would include interleaved activations in some cases (which should be expected with concurrent request processing?), and require some different logic to process logs after an activation completes (e.g. logs emitted at the start of an activation may have already been collected as part of another activation’s log collection, etc).
- advice on creating a PR to discuss this in more detail - should I wait for more of the container pooling changes to get to master? Or submit a PR to Markus’s new-containerpool branch?

Thanks
Tyson

Summary of loadtest report with max-concurrent ENABLED (I used 10000, but this limit wasn’t reached):
[Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Target URL:          https://192.168.99.100/api/v1/namespaces/_/actions/noopThroughputConcurrent?blocking=true
[Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Max requests:        10000
[Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Concurrency level:   10
[Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Agent:               keepalive
[Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO
[Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Completed requests:  10000
[Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Total errors:        0
[Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Total time:          241.900480915 s
[Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Requests per second: 41
[Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Mean latency:        241.7 ms

Summary of loadtest report with max-concurrent DISABLED:
[Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Target URL:          https://192.168.99.100/api/v1/namespaces/_/actions/noopThroughput?blocking=true
[Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Max requests:        10000
[Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Concurrency level:   10
[Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Agent:               keepalive
[Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO
[Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Completed requests:  10000
[Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Total errors:        19
[Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Total time:          2770.658048791 s
[Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Requests per second: 4
[Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Mean latency:        2767.3 ms





[1] https://github.com/openwhisk/openwhisk/issues/2026
[2] https://github.com/markusthoemmes/openwhisk/tree/new-containerpool
[3] https://github.com/markusthoemmes/openwhisk-performance

Re: concurrent requests on actions

Posted by Dascalita Dragos <dd...@gmail.com>.
Why not considering giving developers options to control the level of
concurrency, instead of deciding on their behalf ? I think that cases such
as the ones Tyson is mentioning make sense; unless we build something that
will estimate the resources needed by an action automatically, letting the
developer specify it instead, might be a mean of "supervised learning" that
the system can use further in order to make decisions at runtime.

Dragos
On Mon, May 1, 2017 at 4:46 PM Tyson Norris <tn...@adobe.com> wrote:

> Sure, many of our use cases are mostly stitching together API calls, as
> opposed to being CPU bound - consider a simple javascript action that wraps
> a downstream http API (or many APIs).
>
> What do you mean by “more efficient packing of I/O-bound processes”? For
> example, in the case of actions that wrap an API call, typically the action
> developer is NOT the owner of the API call, so its not clear how to handle
> this more efficiently than by creating a nodejs action that proxies
> (multiple concurrent) network requests around, but does little actual
> computing besides possibly some minor request/response parsing etc. In our
> cases we our much more likely to run into bottlenecks with concurrent users
> without any concurrent container usage support, unless we greatly over
> provision clusters which will provide drastic reduction in efficiency. It
> is much simpler to provision for anticipated or immediate load changes when
> each new container can support multiple concurrent requests, instead of
> each new container supporting a single request.
>
> More tests demonstrating these cases (e.g. API wrappers, and
> compute-centric actions) will help this discussion, I’ll work on providing
> those.
>
> Thanks
> Tyson
>
> > On May 1, 2017, at 3:24 PM, Nick Mitchell <mo...@gmail.com> wrote:
> >
> > won't this only be of benefit for invocations that are mostly sleepy?
> e.g.
> > I/O-bound? because if an action uses CPU flat-out, then there is no
> > throughput win to be had (by increasing the parallelism of CPU-bound
> > processes), given the small CPU sliver that each container gets -- unless
> > there is a concomitant increase in concurrency, i.e. CPU slice?
> >
> > if so, then my gut tells me that there are more general solutions to this
> > (i.e. more efficient packing of I/O-bound processes)
> >
> > On Mon, May 1, 2017 at 5:36 PM, Tyson Norris <tn...@adobe.com> wrote:
> >
> >> Thanks Markus.
> >>
> >> Can you direct me to the travis job where I can see the 40+RPS? I agree
> >> that is a big gap and would like to take a look - I didn’t see anything
> in
> >>
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftravis-ci.org%2Fopenwhisk%2Fopenwhisk%2Fbuilds%2F226918375&data=02%7C01%7C%7C8a29a490bc6545d4460408d490e0c979%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636292742509382993&sdata=2RiV65g7zvR07ditlzosUxsrWvQIo8WfpMvr7g2JHWY%3D&reserved=0
> ; maybe I’m
> >> looking in the wrong place.
> >>
> >> I will work on putting together a PR to discuss.
> >>
> >> Thanks
> >> Tyson
> >>
> >>
> >> On May 1, 2017, at 2:22 PM, Markus Thömmes <markusthoemmes@me.com
> <mailto:
> >> markusthoemmes@me.com>> wrote:
> >>
> >> Hi Tyson,
> >>
> >> Sounds like you did a lot of investigation here, thanks a lot for that
> :)
> >>
> >> Seeing the numbers, 4 RPS in the "off" case seem very odd. The Travis
> >> build that runs the current system as is also reaches 40+ RPS. So we'd
> need
> >> to look at a mismatch here.
> >>
> >> Other than that I'd indeed suspect a great improvement in throughput
> from
> >> your work!
> >>
> >> Implementationwise I don't have a strong opionion but it might be worth
> to
> >> discuss the details first and land your impl. once all my staging is
> done
> >> (the open PRs). That'd ease git operation. If you want to discuss your
> >> impl. now I suggest you send a PR to my new-containerpool branch and
> share
> >> the diff here for discussion.
> >>
> >> Cheers,
> >> Markus
> >>
> >> Von meinem iPhone gesendet
> >>
> >> Am 01.05.2017 um 23:16 schrieb Tyson Norris <tnorris@adobe.com<mailto:
> tnor
> >> ris@adobe.com>>:
> >>
> >> Hi Michael -
> >> Concurrent requests would only reuse a running/warm container for
> >> same-action requests. So if the action has bad/rogue behavior, it will
> >> limit its own throughput only, not the throughput of other actions.
> >>
> >> This is ignoring the current implementation of the activation feed,
> which
> >> I guess is susceptible to a flood of slow running activations. If those
> >> activations are for the same action, running concurrently should be
> enough
> >> to not starve the system for other activations (with faster actions) to
> be
> >> processed. In case they are all different actions, OR not allowed to
> >> execute concurrently, then in the name of quality-of-service, it may
> also
> >> be desirable to reserve some resources (i.e. separate activation feeds)
> for
> >> known-to-be-faster actions, so that fast-running actions are not
> penalized
> >> for existing alongside the slow-running actions. This would require a
> more
> >> complicated throughput test to demonstrate.
> >>
> >> Thanks
> >> Tyson
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On May 1, 2017, at 1:13 PM, Michael Marth <mmarth@adobe.com<mailto:
> mmart
> >> h@adobe.com><ma...@adobe.com>> wrote:
> >>
> >> Hi Tyson,
> >>
> >> 10x more throughput, i.e. Being able to run OW at 1/10 of the cost -
> >> definitely worth looking into :)
> >>
> >> Like Rodric mentioned before I figured some features might become more
> >> complex to implement, like billing, log collection, etc. But given such
> a
> >> huge advancement on throughput that would be worth it IMHO.
> >> One thing I wonder about, though, is resilience against rogue actions.
> If
> >> an action is blocking (in the Node-sense, not the OW-sense), would that
> not
> >> block Node’s event loop and thus block other actions in that container?
> One
> >> could argue, though, that this rogue action would only block other
> >> executions of itself, not affect other actions or customers. WDYT?
> >>
> >> Michael
> >>
> >>
> >>
> >>
> >> On 01/05/17 17:54, "Tyson Norris" <tnorris@adobe.com<mailto:tnor
> >> ris@adobe.com><ma...@adobe.com>> wrote:
> >>
> >> Hi All -
> >> I created this issue some time ago to discuss concurrent requests on
> >> actions: [1] Some people mentioned discussing on the mailing list so I
> >> wanted to start that discussion.
> >>
> >> I’ve been doing some testing against this branch with Markus’s work on
> the
> >> new container pool: [2]
> >> I believe there are a few open PRs in upstream related to this work, but
> >> this seemed like a reasonable place to test against a variety of the
> >> reactive invoker and pool changes - I’d be interested to hear if anyone
> >> disagrees.
> >>
> >> Recently I ran some tests
> >> - with “throughput.sh” in [3] using concurrency of 10 (it will also be
> >> interesting to test with the --rps option in loadtest...)
> >> - using a change that checks actions for an annotation “max-concurrent”
> >> (in case there is some reason actions want to enforce current behavior
> of
> >> strict serial invocation per container?)
> >> - when scheduling an actions against the pool, if there is a currently
> >> “busy” container with this action, AND the annotation is present for
> this
> >> action, AND concurrent requests < max-concurrent, the this container is
> >> used to invoke the action
> >>
> >> Below is a summary (approx 10x throughput with concurrent requests) and
> I
> >> would like to get some feedback on:
> >> - what are the cases for having actions that require container isolation
> >> per request? node is a good example that should NOT need this, but maybe
> >> there are cases where it is more important, e.g. if there are cases
> where
> >> stateful actions are used?
> >> - log collection approach: I have not attempted to resolve log
> collection
> >> issues; I would expect that revising the log sentinel marker to include
> the
> >> activation ID would help, and logs stored with the activation would
> include
> >> interleaved activations in some cases (which should be expected with
> >> concurrent request processing?), and require some different logic to
> >> process logs after an activation completes (e.g. logs emitted at the
> start
> >> of an activation may have already been collected as part of another
> >> activation’s log collection, etc).
> >> - advice on creating a PR to discuss this in more detail - should I wait
> >> for more of the container pooling changes to get to master? Or submit a
> PR
> >> to Markus’s new-containerpool branch?
> >>
> >> Thanks
> >> Tyson
> >>
> >> Summary of loadtest report with max-concurrent ENABLED (I used 10000,
> but
> >> this limit wasn’t reached):
> >> [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Target URL:
> >> https://na01.safelinks.protection.outlook.com/?url=
> >> https%3A%2F%2F192.168.99.100%2Fapi%2Fv1%2Fnamespaces%2F_%2Factions%
> >> 2FnoopThroughputConcurrent%3Fblocking%3Dtrue&data=02%7C01%7C%
> >> 7C796dfc317cde44c9e83908d490ce7faa%7Cfa7b1b5a7b34438794aed2c178de
> >> cee1%7C0%7C0%7C636292663971484169&sdata=uv9kYh5uBoIDXDlEivgMClJ6TDGDmz
> >> TdKOgZPZjkBko%3D&reserved=0
> >> [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Max requests:
> 10000
> >> [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Concurrency level:   10
> >> [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Agent:
> >> keepalive
> >> [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO
> >> [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Completed requests:
> 10000
> >> [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Total errors:        0
> >> [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Total time:
> >> 241.900480915 s
> >> [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Requests per second: 41
> >> [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Mean latency:
> 241.7
> >> ms
> >>
> >> Summary of loadtest report with max-concurrent DISABLED:
> >> [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Target URL:
> >> https://na01.safelinks.protection.outlook.com/?url=
> >> https%3A%2F%2F192.168.99.100%2Fapi%2Fv1%2Fnamespaces%2F_%
> >> 2Factions%2FnoopThroughput%3Fblocking%3Dtrue&data=02%7C01%7C%
> >> 7C796dfc317cde44c9e83908d490ce7faa%7Cfa7b1b5a7b34438794aed2c178de
> >> cee1%7C0%7C0%7C636292663971494178&sdata=h6sMS0s2WQXFMcLg8sSAq%2F56p%
> >> 2F%2BmVmth%2B%2FsqTOVmeAc%3D&reserved=0
> >> [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Max requests:
> 10000
> >> [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Concurrency level:   10
> >> [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Agent:
> >> keepalive
> >> [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO
> >> [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Completed requests:
> 10000
> >> [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Total errors:        19
> >> [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Total time:
> >> 2770.658048791 s
> >> [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Requests per second: 4
> >> [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Mean latency:
> 2767.3
> >> ms
> >>
> >>
> >>
> >>
> >>
> >> [1] https://na01.safelinks.protection.outlook.com/?url=
> >> https%3A%2F%2Fgithub.com%2Fopenwhisk%2Fopenwhisk%
> >> 2Fissues%2F2026&data=02%7C01%7C%7C796dfc317cde44c9e83908d490ce7faa%
> >>
> 7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636292663971494178&sdata=eg%
> >> 2FsSPRQYapQHPNbfMLCW%2B%2F1yAqn8zSo0nJ5yQjmkns%3D&reserved=0
> >> [2] https://na01.safelinks.protection.outlook.com/?url=
> >> https%3A%2F%2Fgithub.com%2Fmarkusthoemmes%2Fopenwhisk%
> >>
> 2Ftree%2Fnew-containerpool&data=02%7C01%7C%7C796dfc317cde44c9e83908d490ce
> >> 7faa%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%
> >> 7C636292663971494178&sdata=IZcN9szW71SdL%2ByssJm9k3EgzaU4b5idI5yFWyR7%
> >> 2BL4%3D&reserved=0
> >> [3] https://na01.safelinks.protection.outlook.com/?url=
> >> https%3A%2F%2Fgithub.com%2Fmarkusthoemmes%2Fopenwhisk-
> >> performance&data=02%7C01%7C%7C796dfc317cde44c9e83908d490ce7faa%
> >> 7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636292663971494178&sdata=
> >> WkOlhTsplKQm6mUkZtwWLXzCrQg%2FUmKtqOErIw6gFAA%3D&reserved=0
> >>
> >>
> >>
>
>

Re: concurrent requests on actions

Posted by Tyson Norris <tn...@adobe.com>.
Sure, many of our use cases are mostly stitching together API calls, as opposed to being CPU bound - consider a simple javascript action that wraps a downstream http API (or many APIs). 

What do you mean by “more efficient packing of I/O-bound processes”? For example, in the case of actions that wrap an API call, typically the action developer is NOT the owner of the API call, so its not clear how to handle this more efficiently than by creating a nodejs action that proxies (multiple concurrent) network requests around, but does little actual computing besides possibly some minor request/response parsing etc. In our cases we our much more likely to run into bottlenecks with concurrent users without any concurrent container usage support, unless we greatly over provision clusters which will provide drastic reduction in efficiency. It is much simpler to provision for anticipated or immediate load changes when each new container can support multiple concurrent requests, instead of each new container supporting a single request.

More tests demonstrating these cases (e.g. API wrappers, and compute-centric actions) will help this discussion, I’ll work on providing those. 

Thanks
Tyson 

> On May 1, 2017, at 3:24 PM, Nick Mitchell <mo...@gmail.com> wrote:
> 
> won't this only be of benefit for invocations that are mostly sleepy? e.g.
> I/O-bound? because if an action uses CPU flat-out, then there is no
> throughput win to be had (by increasing the parallelism of CPU-bound
> processes), given the small CPU sliver that each container gets -- unless
> there is a concomitant increase in concurrency, i.e. CPU slice?
> 
> if so, then my gut tells me that there are more general solutions to this
> (i.e. more efficient packing of I/O-bound processes)
> 
> On Mon, May 1, 2017 at 5:36 PM, Tyson Norris <tn...@adobe.com> wrote:
> 
>> Thanks Markus.
>> 
>> Can you direct me to the travis job where I can see the 40+RPS? I agree
>> that is a big gap and would like to take a look - I didn’t see anything in
>> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftravis-ci.org%2Fopenwhisk%2Fopenwhisk%2Fbuilds%2F226918375&data=02%7C01%7C%7C8a29a490bc6545d4460408d490e0c979%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636292742509382993&sdata=2RiV65g7zvR07ditlzosUxsrWvQIo8WfpMvr7g2JHWY%3D&reserved=0 ; maybe I’m
>> looking in the wrong place.
>> 
>> I will work on putting together a PR to discuss.
>> 
>> Thanks
>> Tyson
>> 
>> 
>> On May 1, 2017, at 2:22 PM, Markus Thömmes <markusthoemmes@me.com<mailto:
>> markusthoemmes@me.com>> wrote:
>> 
>> Hi Tyson,
>> 
>> Sounds like you did a lot of investigation here, thanks a lot for that :)
>> 
>> Seeing the numbers, 4 RPS in the "off" case seem very odd. The Travis
>> build that runs the current system as is also reaches 40+ RPS. So we'd need
>> to look at a mismatch here.
>> 
>> Other than that I'd indeed suspect a great improvement in throughput from
>> your work!
>> 
>> Implementationwise I don't have a strong opionion but it might be worth to
>> discuss the details first and land your impl. once all my staging is done
>> (the open PRs). That'd ease git operation. If you want to discuss your
>> impl. now I suggest you send a PR to my new-containerpool branch and share
>> the diff here for discussion.
>> 
>> Cheers,
>> Markus
>> 
>> Von meinem iPhone gesendet
>> 
>> Am 01.05.2017 um 23:16 schrieb Tyson Norris <tnorris@adobe.com<mailto:tnor
>> ris@adobe.com>>:
>> 
>> Hi Michael -
>> Concurrent requests would only reuse a running/warm container for
>> same-action requests. So if the action has bad/rogue behavior, it will
>> limit its own throughput only, not the throughput of other actions.
>> 
>> This is ignoring the current implementation of the activation feed, which
>> I guess is susceptible to a flood of slow running activations. If those
>> activations are for the same action, running concurrently should be enough
>> to not starve the system for other activations (with faster actions) to be
>> processed. In case they are all different actions, OR not allowed to
>> execute concurrently, then in the name of quality-of-service, it may also
>> be desirable to reserve some resources (i.e. separate activation feeds) for
>> known-to-be-faster actions, so that fast-running actions are not penalized
>> for existing alongside the slow-running actions. This would require a more
>> complicated throughput test to demonstrate.
>> 
>> Thanks
>> Tyson
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> On May 1, 2017, at 1:13 PM, Michael Marth <mmarth@adobe.com<mailto:mmart
>> h@adobe.com><ma...@adobe.com>> wrote:
>> 
>> Hi Tyson,
>> 
>> 10x more throughput, i.e. Being able to run OW at 1/10 of the cost -
>> definitely worth looking into :)
>> 
>> Like Rodric mentioned before I figured some features might become more
>> complex to implement, like billing, log collection, etc. But given such a
>> huge advancement on throughput that would be worth it IMHO.
>> One thing I wonder about, though, is resilience against rogue actions. If
>> an action is blocking (in the Node-sense, not the OW-sense), would that not
>> block Node’s event loop and thus block other actions in that container? One
>> could argue, though, that this rogue action would only block other
>> executions of itself, not affect other actions or customers. WDYT?
>> 
>> Michael
>> 
>> 
>> 
>> 
>> On 01/05/17 17:54, "Tyson Norris" <tnorris@adobe.com<mailto:tnor
>> ris@adobe.com><ma...@adobe.com>> wrote:
>> 
>> Hi All -
>> I created this issue some time ago to discuss concurrent requests on
>> actions: [1] Some people mentioned discussing on the mailing list so I
>> wanted to start that discussion.
>> 
>> I’ve been doing some testing against this branch with Markus’s work on the
>> new container pool: [2]
>> I believe there are a few open PRs in upstream related to this work, but
>> this seemed like a reasonable place to test against a variety of the
>> reactive invoker and pool changes - I’d be interested to hear if anyone
>> disagrees.
>> 
>> Recently I ran some tests
>> - with “throughput.sh” in [3] using concurrency of 10 (it will also be
>> interesting to test with the --rps option in loadtest...)
>> - using a change that checks actions for an annotation “max-concurrent”
>> (in case there is some reason actions want to enforce current behavior of
>> strict serial invocation per container?)
>> - when scheduling an actions against the pool, if there is a currently
>> “busy” container with this action, AND the annotation is present for this
>> action, AND concurrent requests < max-concurrent, the this container is
>> used to invoke the action
>> 
>> Below is a summary (approx 10x throughput with concurrent requests) and I
>> would like to get some feedback on:
>> - what are the cases for having actions that require container isolation
>> per request? node is a good example that should NOT need this, but maybe
>> there are cases where it is more important, e.g. if there are cases where
>> stateful actions are used?
>> - log collection approach: I have not attempted to resolve log collection
>> issues; I would expect that revising the log sentinel marker to include the
>> activation ID would help, and logs stored with the activation would include
>> interleaved activations in some cases (which should be expected with
>> concurrent request processing?), and require some different logic to
>> process logs after an activation completes (e.g. logs emitted at the start
>> of an activation may have already been collected as part of another
>> activation’s log collection, etc).
>> - advice on creating a PR to discuss this in more detail - should I wait
>> for more of the container pooling changes to get to master? Or submit a PR
>> to Markus’s new-containerpool branch?
>> 
>> Thanks
>> Tyson
>> 
>> Summary of loadtest report with max-concurrent ENABLED (I used 10000, but
>> this limit wasn’t reached):
>> [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Target URL:
>> https://na01.safelinks.protection.outlook.com/?url=
>> https%3A%2F%2F192.168.99.100%2Fapi%2Fv1%2Fnamespaces%2F_%2Factions%
>> 2FnoopThroughputConcurrent%3Fblocking%3Dtrue&data=02%7C01%7C%
>> 7C796dfc317cde44c9e83908d490ce7faa%7Cfa7b1b5a7b34438794aed2c178de
>> cee1%7C0%7C0%7C636292663971484169&sdata=uv9kYh5uBoIDXDlEivgMClJ6TDGDmz
>> TdKOgZPZjkBko%3D&reserved=0
>> [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Max requests:        10000
>> [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Concurrency level:   10
>> [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Agent:
>> keepalive
>> [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO
>> [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Completed requests:  10000
>> [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Total errors:        0
>> [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Total time:
>> 241.900480915 s
>> [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Requests per second: 41
>> [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Mean latency:        241.7
>> ms
>> 
>> Summary of loadtest report with max-concurrent DISABLED:
>> [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Target URL:
>> https://na01.safelinks.protection.outlook.com/?url=
>> https%3A%2F%2F192.168.99.100%2Fapi%2Fv1%2Fnamespaces%2F_%
>> 2Factions%2FnoopThroughput%3Fblocking%3Dtrue&data=02%7C01%7C%
>> 7C796dfc317cde44c9e83908d490ce7faa%7Cfa7b1b5a7b34438794aed2c178de
>> cee1%7C0%7C0%7C636292663971494178&sdata=h6sMS0s2WQXFMcLg8sSAq%2F56p%
>> 2F%2BmVmth%2B%2FsqTOVmeAc%3D&reserved=0
>> [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Max requests:        10000
>> [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Concurrency level:   10
>> [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Agent:
>> keepalive
>> [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO
>> [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Completed requests:  10000
>> [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Total errors:        19
>> [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Total time:
>> 2770.658048791 s
>> [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Requests per second: 4
>> [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Mean latency:        2767.3
>> ms
>> 
>> 
>> 
>> 
>> 
>> [1] https://na01.safelinks.protection.outlook.com/?url=
>> https%3A%2F%2Fgithub.com%2Fopenwhisk%2Fopenwhisk%
>> 2Fissues%2F2026&data=02%7C01%7C%7C796dfc317cde44c9e83908d490ce7faa%
>> 7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636292663971494178&sdata=eg%
>> 2FsSPRQYapQHPNbfMLCW%2B%2F1yAqn8zSo0nJ5yQjmkns%3D&reserved=0
>> [2] https://na01.safelinks.protection.outlook.com/?url=
>> https%3A%2F%2Fgithub.com%2Fmarkusthoemmes%2Fopenwhisk%
>> 2Ftree%2Fnew-containerpool&data=02%7C01%7C%7C796dfc317cde44c9e83908d490ce
>> 7faa%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%
>> 7C636292663971494178&sdata=IZcN9szW71SdL%2ByssJm9k3EgzaU4b5idI5yFWyR7%
>> 2BL4%3D&reserved=0
>> [3] https://na01.safelinks.protection.outlook.com/?url=
>> https%3A%2F%2Fgithub.com%2Fmarkusthoemmes%2Fopenwhisk-
>> performance&data=02%7C01%7C%7C796dfc317cde44c9e83908d490ce7faa%
>> 7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636292663971494178&sdata=
>> WkOlhTsplKQm6mUkZtwWLXzCrQg%2FUmKtqOErIw6gFAA%3D&reserved=0
>> 
>> 
>> 


Re: concurrent requests on actions

Posted by Nick Mitchell <mo...@gmail.com>.
won't this only be of benefit for invocations that are mostly sleepy? e.g.
I/O-bound? because if an action uses CPU flat-out, then there is no
throughput win to be had (by increasing the parallelism of CPU-bound
processes), given the small CPU sliver that each container gets -- unless
there is a concomitant increase in concurrency, i.e. CPU slice?

if so, then my gut tells me that there are more general solutions to this
(i.e. more efficient packing of I/O-bound processes)

On Mon, May 1, 2017 at 5:36 PM, Tyson Norris <tn...@adobe.com> wrote:

> Thanks Markus.
>
> Can you direct me to the travis job where I can see the 40+RPS? I agree
> that is a big gap and would like to take a look - I didn’t see anything in
> https://travis-ci.org/openwhisk/openwhisk/builds/226918375 ; maybe I’m
> looking in the wrong place.
>
> I will work on putting together a PR to discuss.
>
> Thanks
> Tyson
>
>
> On May 1, 2017, at 2:22 PM, Markus Thömmes <markusthoemmes@me.com<mailto:
> markusthoemmes@me.com>> wrote:
>
> Hi Tyson,
>
> Sounds like you did a lot of investigation here, thanks a lot for that :)
>
> Seeing the numbers, 4 RPS in the "off" case seem very odd. The Travis
> build that runs the current system as is also reaches 40+ RPS. So we'd need
> to look at a mismatch here.
>
> Other than that I'd indeed suspect a great improvement in throughput from
> your work!
>
> Implementationwise I don't have a strong opionion but it might be worth to
> discuss the details first and land your impl. once all my staging is done
> (the open PRs). That'd ease git operation. If you want to discuss your
> impl. now I suggest you send a PR to my new-containerpool branch and share
> the diff here for discussion.
>
> Cheers,
> Markus
>
> Von meinem iPhone gesendet
>
> Am 01.05.2017 um 23:16 schrieb Tyson Norris <tnorris@adobe.com<mailto:tnor
> ris@adobe.com>>:
>
> Hi Michael -
> Concurrent requests would only reuse a running/warm container for
> same-action requests. So if the action has bad/rogue behavior, it will
> limit its own throughput only, not the throughput of other actions.
>
> This is ignoring the current implementation of the activation feed, which
> I guess is susceptible to a flood of slow running activations. If those
> activations are for the same action, running concurrently should be enough
> to not starve the system for other activations (with faster actions) to be
> processed. In case they are all different actions, OR not allowed to
> execute concurrently, then in the name of quality-of-service, it may also
> be desirable to reserve some resources (i.e. separate activation feeds) for
> known-to-be-faster actions, so that fast-running actions are not penalized
> for existing alongside the slow-running actions. This would require a more
> complicated throughput test to demonstrate.
>
> Thanks
> Tyson
>
>
>
>
>
>
>
> On May 1, 2017, at 1:13 PM, Michael Marth <mmarth@adobe.com<mailto:mmart
> h@adobe.com><ma...@adobe.com>> wrote:
>
> Hi Tyson,
>
> 10x more throughput, i.e. Being able to run OW at 1/10 of the cost -
> definitely worth looking into :)
>
> Like Rodric mentioned before I figured some features might become more
> complex to implement, like billing, log collection, etc. But given such a
> huge advancement on throughput that would be worth it IMHO.
> One thing I wonder about, though, is resilience against rogue actions. If
> an action is blocking (in the Node-sense, not the OW-sense), would that not
> block Node’s event loop and thus block other actions in that container? One
> could argue, though, that this rogue action would only block other
> executions of itself, not affect other actions or customers. WDYT?
>
> Michael
>
>
>
>
> On 01/05/17 17:54, "Tyson Norris" <tnorris@adobe.com<mailto:tnor
> ris@adobe.com><ma...@adobe.com>> wrote:
>
> Hi All -
> I created this issue some time ago to discuss concurrent requests on
> actions: [1] Some people mentioned discussing on the mailing list so I
> wanted to start that discussion.
>
> I’ve been doing some testing against this branch with Markus’s work on the
> new container pool: [2]
> I believe there are a few open PRs in upstream related to this work, but
> this seemed like a reasonable place to test against a variety of the
> reactive invoker and pool changes - I’d be interested to hear if anyone
> disagrees.
>
> Recently I ran some tests
> - with “throughput.sh” in [3] using concurrency of 10 (it will also be
> interesting to test with the --rps option in loadtest...)
> - using a change that checks actions for an annotation “max-concurrent”
> (in case there is some reason actions want to enforce current behavior of
> strict serial invocation per container?)
> - when scheduling an actions against the pool, if there is a currently
> “busy” container with this action, AND the annotation is present for this
> action, AND concurrent requests < max-concurrent, the this container is
> used to invoke the action
>
> Below is a summary (approx 10x throughput with concurrent requests) and I
> would like to get some feedback on:
> - what are the cases for having actions that require container isolation
> per request? node is a good example that should NOT need this, but maybe
> there are cases where it is more important, e.g. if there are cases where
> stateful actions are used?
> - log collection approach: I have not attempted to resolve log collection
> issues; I would expect that revising the log sentinel marker to include the
> activation ID would help, and logs stored with the activation would include
> interleaved activations in some cases (which should be expected with
> concurrent request processing?), and require some different logic to
> process logs after an activation completes (e.g. logs emitted at the start
> of an activation may have already been collected as part of another
> activation’s log collection, etc).
> - advice on creating a PR to discuss this in more detail - should I wait
> for more of the container pooling changes to get to master? Or submit a PR
> to Markus’s new-containerpool branch?
>
> Thanks
> Tyson
>
> Summary of loadtest report with max-concurrent ENABLED (I used 10000, but
> this limit wasn’t reached):
> [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Target URL:
> https://na01.safelinks.protection.outlook.com/?url=
> https%3A%2F%2F192.168.99.100%2Fapi%2Fv1%2Fnamespaces%2F_%2Factions%
> 2FnoopThroughputConcurrent%3Fblocking%3Dtrue&data=02%7C01%7C%
> 7C796dfc317cde44c9e83908d490ce7faa%7Cfa7b1b5a7b34438794aed2c178de
> cee1%7C0%7C0%7C636292663971484169&sdata=uv9kYh5uBoIDXDlEivgMClJ6TDGDmz
> TdKOgZPZjkBko%3D&reserved=0
> [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Max requests:        10000
> [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Concurrency level:   10
> [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Agent:
>  keepalive
> [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO
> [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Completed requests:  10000
> [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Total errors:        0
> [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Total time:
> 241.900480915 s
> [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Requests per second: 41
> [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Mean latency:        241.7
> ms
>
> Summary of loadtest report with max-concurrent DISABLED:
> [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Target URL:
> https://na01.safelinks.protection.outlook.com/?url=
> https%3A%2F%2F192.168.99.100%2Fapi%2Fv1%2Fnamespaces%2F_%
> 2Factions%2FnoopThroughput%3Fblocking%3Dtrue&data=02%7C01%7C%
> 7C796dfc317cde44c9e83908d490ce7faa%7Cfa7b1b5a7b34438794aed2c178de
> cee1%7C0%7C0%7C636292663971494178&sdata=h6sMS0s2WQXFMcLg8sSAq%2F56p%
> 2F%2BmVmth%2B%2FsqTOVmeAc%3D&reserved=0
> [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Max requests:        10000
> [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Concurrency level:   10
> [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Agent:
>  keepalive
> [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO
> [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Completed requests:  10000
> [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Total errors:        19
> [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Total time:
> 2770.658048791 s
> [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Requests per second: 4
> [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Mean latency:        2767.3
> ms
>
>
>
>
>
> [1] https://na01.safelinks.protection.outlook.com/?url=
> https%3A%2F%2Fgithub.com%2Fopenwhisk%2Fopenwhisk%
> 2Fissues%2F2026&data=02%7C01%7C%7C796dfc317cde44c9e83908d490ce7faa%
> 7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636292663971494178&sdata=eg%
> 2FsSPRQYapQHPNbfMLCW%2B%2F1yAqn8zSo0nJ5yQjmkns%3D&reserved=0
> [2] https://na01.safelinks.protection.outlook.com/?url=
> https%3A%2F%2Fgithub.com%2Fmarkusthoemmes%2Fopenwhisk%
> 2Ftree%2Fnew-containerpool&data=02%7C01%7C%7C796dfc317cde44c9e83908d490ce
> 7faa%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%
> 7C636292663971494178&sdata=IZcN9szW71SdL%2ByssJm9k3EgzaU4b5idI5yFWyR7%
> 2BL4%3D&reserved=0
> [3] https://na01.safelinks.protection.outlook.com/?url=
> https%3A%2F%2Fgithub.com%2Fmarkusthoemmes%2Fopenwhisk-
> performance&data=02%7C01%7C%7C796dfc317cde44c9e83908d490ce7faa%
> 7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636292663971494178&sdata=
> WkOlhTsplKQm6mUkZtwWLXzCrQg%2FUmKtqOErIw6gFAA%3D&reserved=0
>
>
>

Re: concurrent requests on actions

Posted by Tyson Norris <tn...@adobe.com>.
Thanks Markus.

Can you direct me to the travis job where I can see the 40+RPS? I agree that is a big gap and would like to take a look - I didn’t see anything in https://travis-ci.org/openwhisk/openwhisk/builds/226918375 ; maybe I’m looking in the wrong place.

I will work on putting together a PR to discuss.

Thanks
Tyson


On May 1, 2017, at 2:22 PM, Markus Thömmes <ma...@me.com>> wrote:

Hi Tyson,

Sounds like you did a lot of investigation here, thanks a lot for that :)

Seeing the numbers, 4 RPS in the "off" case seem very odd. The Travis build that runs the current system as is also reaches 40+ RPS. So we'd need to look at a mismatch here.

Other than that I'd indeed suspect a great improvement in throughput from your work!

Implementationwise I don't have a strong opionion but it might be worth to discuss the details first and land your impl. once all my staging is done (the open PRs). That'd ease git operation. If you want to discuss your impl. now I suggest you send a PR to my new-containerpool branch and share the diff here for discussion.

Cheers,
Markus

Von meinem iPhone gesendet

Am 01.05.2017 um 23:16 schrieb Tyson Norris <tn...@adobe.com>>:

Hi Michael -
Concurrent requests would only reuse a running/warm container for same-action requests. So if the action has bad/rogue behavior, it will limit its own throughput only, not the throughput of other actions.

This is ignoring the current implementation of the activation feed, which I guess is susceptible to a flood of slow running activations. If those activations are for the same action, running concurrently should be enough to not starve the system for other activations (with faster actions) to be processed. In case they are all different actions, OR not allowed to execute concurrently, then in the name of quality-of-service, it may also be desirable to reserve some resources (i.e. separate activation feeds) for known-to-be-faster actions, so that fast-running actions are not penalized for existing alongside the slow-running actions. This would require a more complicated throughput test to demonstrate.

Thanks
Tyson







On May 1, 2017, at 1:13 PM, Michael Marth <mm...@adobe.com>> wrote:

Hi Tyson,

10x more throughput, i.e. Being able to run OW at 1/10 of the cost - definitely worth looking into :)

Like Rodric mentioned before I figured some features might become more complex to implement, like billing, log collection, etc. But given such a huge advancement on throughput that would be worth it IMHO.
One thing I wonder about, though, is resilience against rogue actions. If an action is blocking (in the Node-sense, not the OW-sense), would that not block Node’s event loop and thus block other actions in that container? One could argue, though, that this rogue action would only block other executions of itself, not affect other actions or customers. WDYT?

Michael




On 01/05/17 17:54, "Tyson Norris" <tn...@adobe.com>> wrote:

Hi All -
I created this issue some time ago to discuss concurrent requests on actions: [1] Some people mentioned discussing on the mailing list so I wanted to start that discussion.

I’ve been doing some testing against this branch with Markus’s work on the new container pool: [2]
I believe there are a few open PRs in upstream related to this work, but this seemed like a reasonable place to test against a variety of the reactive invoker and pool changes - I’d be interested to hear if anyone disagrees.

Recently I ran some tests
- with “throughput.sh” in [3] using concurrency of 10 (it will also be interesting to test with the --rps option in loadtest...)
- using a change that checks actions for an annotation “max-concurrent” (in case there is some reason actions want to enforce current behavior of strict serial invocation per container?)
- when scheduling an actions against the pool, if there is a currently “busy” container with this action, AND the annotation is present for this action, AND concurrent requests < max-concurrent, the this container is used to invoke the action

Below is a summary (approx 10x throughput with concurrent requests) and I would like to get some feedback on:
- what are the cases for having actions that require container isolation per request? node is a good example that should NOT need this, but maybe there are cases where it is more important, e.g. if there are cases where stateful actions are used?
- log collection approach: I have not attempted to resolve log collection issues; I would expect that revising the log sentinel marker to include the activation ID would help, and logs stored with the activation would include interleaved activations in some cases (which should be expected with concurrent request processing?), and require some different logic to process logs after an activation completes (e.g. logs emitted at the start of an activation may have already been collected as part of another activation’s log collection, etc).
- advice on creating a PR to discuss this in more detail - should I wait for more of the container pooling changes to get to master? Or submit a PR to Markus’s new-containerpool branch?

Thanks
Tyson

Summary of loadtest report with max-concurrent ENABLED (I used 10000, but this limit wasn’t reached):
[Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Target URL:          https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2F192.168.99.100%2Fapi%2Fv1%2Fnamespaces%2F_%2Factions%2FnoopThroughputConcurrent%3Fblocking%3Dtrue&data=02%7C01%7C%7C796dfc317cde44c9e83908d490ce7faa%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636292663971484169&sdata=uv9kYh5uBoIDXDlEivgMClJ6TDGDmzTdKOgZPZjkBko%3D&reserved=0
[Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Max requests:        10000
[Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Concurrency level:   10
[Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Agent:               keepalive
[Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO
[Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Completed requests:  10000
[Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Total errors:        0
[Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Total time:          241.900480915 s
[Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Requests per second: 41
[Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Mean latency:        241.7 ms

Summary of loadtest report with max-concurrent DISABLED:
[Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Target URL:          https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2F192.168.99.100%2Fapi%2Fv1%2Fnamespaces%2F_%2Factions%2FnoopThroughput%3Fblocking%3Dtrue&data=02%7C01%7C%7C796dfc317cde44c9e83908d490ce7faa%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636292663971494178&sdata=h6sMS0s2WQXFMcLg8sSAq%2F56p%2F%2BmVmth%2B%2FsqTOVmeAc%3D&reserved=0
[Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Max requests:        10000
[Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Concurrency level:   10
[Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Agent:               keepalive
[Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO
[Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Completed requests:  10000
[Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Total errors:        19
[Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Total time:          2770.658048791 s
[Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Requests per second: 4
[Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Mean latency:        2767.3 ms





[1] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fopenwhisk%2Fopenwhisk%2Fissues%2F2026&data=02%7C01%7C%7C796dfc317cde44c9e83908d490ce7faa%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636292663971494178&sdata=eg%2FsSPRQYapQHPNbfMLCW%2B%2F1yAqn8zSo0nJ5yQjmkns%3D&reserved=0
[2] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmarkusthoemmes%2Fopenwhisk%2Ftree%2Fnew-containerpool&data=02%7C01%7C%7C796dfc317cde44c9e83908d490ce7faa%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636292663971494178&sdata=IZcN9szW71SdL%2ByssJm9k3EgzaU4b5idI5yFWyR7%2BL4%3D&reserved=0
[3] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmarkusthoemmes%2Fopenwhisk-performance&data=02%7C01%7C%7C796dfc317cde44c9e83908d490ce7faa%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636292663971494178&sdata=WkOlhTsplKQm6mUkZtwWLXzCrQg%2FUmKtqOErIw6gFAA%3D&reserved=0



Re: concurrent requests on actions

Posted by Markus Thömmes <ma...@me.com>.
Hi Tyson,

Sounds like you did a lot of investigation here, thanks a lot for that :)

Seeing the numbers, 4 RPS in the "off" case seem very odd. The Travis build that runs the current system as is also reaches 40+ RPS. So we'd need to look at a mismatch here.

Other than that I'd indeed suspect a great improvement in throughput from your work!

Implementationwise I don't have a strong opionion but it might be worth to discuss the details first and land your impl. once all my staging is done (the open PRs). That'd ease git operation. If you want to discuss your impl. now I suggest you send a PR to my new-containerpool branch and share the diff here for discussion.

Cheers,
Markus

Von meinem iPhone gesendet

> Am 01.05.2017 um 23:16 schrieb Tyson Norris <tn...@adobe.com>:
> 
> Hi Michael -
> Concurrent requests would only reuse a running/warm container for same-action requests. So if the action has bad/rogue behavior, it will limit its own throughput only, not the throughput of other actions.
> 
> This is ignoring the current implementation of the activation feed, which I guess is susceptible to a flood of slow running activations. If those activations are for the same action, running concurrently should be enough to not starve the system for other activations (with faster actions) to be processed. In case they are all different actions, OR not allowed to execute concurrently, then in the name of quality-of-service, it may also be desirable to reserve some resources (i.e. separate activation feeds) for known-to-be-faster actions, so that fast-running actions are not penalized for existing alongside the slow-running actions. This would require a more complicated throughput test to demonstrate.
> 
> Thanks
> Tyson
> 
> 
> 
> 
> 
> 
> 
> On May 1, 2017, at 1:13 PM, Michael Marth <mm...@adobe.com>> wrote:
> 
> Hi Tyson,
> 
> 10x more throughput, i.e. Being able to run OW at 1/10 of the cost - definitely worth looking into :)
> 
> Like Rodric mentioned before I figured some features might become more complex to implement, like billing, log collection, etc. But given such a huge advancement on throughput that would be worth it IMHO.
> One thing I wonder about, though, is resilience against rogue actions. If an action is blocking (in the Node-sense, not the OW-sense), would that not block Node’s event loop and thus block other actions in that container? One could argue, though, that this rogue action would only block other executions of itself, not affect other actions or customers. WDYT?
> 
> Michael
> 
> 
> 
> 
> On 01/05/17 17:54, "Tyson Norris" <tn...@adobe.com>> wrote:
> 
> Hi All -
> I created this issue some time ago to discuss concurrent requests on actions: [1] Some people mentioned discussing on the mailing list so I wanted to start that discussion.
> 
> I’ve been doing some testing against this branch with Markus’s work on the new container pool: [2]
> I believe there are a few open PRs in upstream related to this work, but this seemed like a reasonable place to test against a variety of the reactive invoker and pool changes - I’d be interested to hear if anyone disagrees.
> 
> Recently I ran some tests
> - with “throughput.sh” in [3] using concurrency of 10 (it will also be interesting to test with the --rps option in loadtest...)
> - using a change that checks actions for an annotation “max-concurrent” (in case there is some reason actions want to enforce current behavior of strict serial invocation per container?)
> - when scheduling an actions against the pool, if there is a currently “busy” container with this action, AND the annotation is present for this action, AND concurrent requests < max-concurrent, the this container is used to invoke the action
> 
> Below is a summary (approx 10x throughput with concurrent requests) and I would like to get some feedback on:
> - what are the cases for having actions that require container isolation per request? node is a good example that should NOT need this, but maybe there are cases where it is more important, e.g. if there are cases where stateful actions are used?
> - log collection approach: I have not attempted to resolve log collection issues; I would expect that revising the log sentinel marker to include the activation ID would help, and logs stored with the activation would include interleaved activations in some cases (which should be expected with concurrent request processing?), and require some different logic to process logs after an activation completes (e.g. logs emitted at the start of an activation may have already been collected as part of another activation’s log collection, etc).
> - advice on creating a PR to discuss this in more detail - should I wait for more of the container pooling changes to get to master? Or submit a PR to Markus’s new-containerpool branch?
> 
> Thanks
> Tyson
> 
> Summary of loadtest report with max-concurrent ENABLED (I used 10000, but this limit wasn’t reached):
> [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Target URL:          https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2F192.168.99.100%2Fapi%2Fv1%2Fnamespaces%2F_%2Factions%2FnoopThroughputConcurrent%3Fblocking%3Dtrue&data=02%7C01%7C%7C796dfc317cde44c9e83908d490ce7faa%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636292663971484169&sdata=uv9kYh5uBoIDXDlEivgMClJ6TDGDmzTdKOgZPZjkBko%3D&reserved=0
> [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Max requests:        10000
> [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Concurrency level:   10
> [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Agent:               keepalive
> [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO
> [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Completed requests:  10000
> [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Total errors:        0
> [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Total time:          241.900480915 s
> [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Requests per second: 41
> [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Mean latency:        241.7 ms
> 
> Summary of loadtest report with max-concurrent DISABLED:
> [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Target URL:          https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2F192.168.99.100%2Fapi%2Fv1%2Fnamespaces%2F_%2Factions%2FnoopThroughput%3Fblocking%3Dtrue&data=02%7C01%7C%7C796dfc317cde44c9e83908d490ce7faa%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636292663971494178&sdata=h6sMS0s2WQXFMcLg8sSAq%2F56p%2F%2BmVmth%2B%2FsqTOVmeAc%3D&reserved=0
> [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Max requests:        10000
> [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Concurrency level:   10
> [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Agent:               keepalive
> [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO
> [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Completed requests:  10000
> [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Total errors:        19
> [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Total time:          2770.658048791 s
> [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Requests per second: 4
> [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Mean latency:        2767.3 ms
> 
> 
> 
> 
> 
> [1] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fopenwhisk%2Fopenwhisk%2Fissues%2F2026&data=02%7C01%7C%7C796dfc317cde44c9e83908d490ce7faa%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636292663971494178&sdata=eg%2FsSPRQYapQHPNbfMLCW%2B%2F1yAqn8zSo0nJ5yQjmkns%3D&reserved=0
> [2] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmarkusthoemmes%2Fopenwhisk%2Ftree%2Fnew-containerpool&data=02%7C01%7C%7C796dfc317cde44c9e83908d490ce7faa%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636292663971494178&sdata=IZcN9szW71SdL%2ByssJm9k3EgzaU4b5idI5yFWyR7%2BL4%3D&reserved=0
> [3] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmarkusthoemmes%2Fopenwhisk-performance&data=02%7C01%7C%7C796dfc317cde44c9e83908d490ce7faa%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636292663971494178&sdata=WkOlhTsplKQm6mUkZtwWLXzCrQg%2FUmKtqOErIw6gFAA%3D&reserved=0
> 

Re: concurrent requests on actions

Posted by Tyson Norris <tn...@adobe.com>.
Hi Michael -
Concurrent requests would only reuse a running/warm container for same-action requests. So if the action has bad/rogue behavior, it will limit its own throughput only, not the throughput of other actions.

This is ignoring the current implementation of the activation feed, which I guess is susceptible to a flood of slow running activations. If those activations are for the same action, running concurrently should be enough to not starve the system for other activations (with faster actions) to be processed. In case they are all different actions, OR not allowed to execute concurrently, then in the name of quality-of-service, it may also be desirable to reserve some resources (i.e. separate activation feeds) for known-to-be-faster actions, so that fast-running actions are not penalized for existing alongside the slow-running actions. This would require a more complicated throughput test to demonstrate.

Thanks
Tyson







On May 1, 2017, at 1:13 PM, Michael Marth <mm...@adobe.com>> wrote:

Hi Tyson,

10x more throughput, i.e. Being able to run OW at 1/10 of the cost - definitely worth looking into :)

Like Rodric mentioned before I figured some features might become more complex to implement, like billing, log collection, etc. But given such a huge advancement on throughput that would be worth it IMHO.
One thing I wonder about, though, is resilience against rogue actions. If an action is blocking (in the Node-sense, not the OW-sense), would that not block Node’s event loop and thus block other actions in that container? One could argue, though, that this rogue action would only block other executions of itself, not affect other actions or customers. WDYT?

Michael




On 01/05/17 17:54, "Tyson Norris" <tn...@adobe.com>> wrote:

Hi All -
I created this issue some time ago to discuss concurrent requests on actions: [1] Some people mentioned discussing on the mailing list so I wanted to start that discussion.

I’ve been doing some testing against this branch with Markus’s work on the new container pool: [2]
I believe there are a few open PRs in upstream related to this work, but this seemed like a reasonable place to test against a variety of the reactive invoker and pool changes - I’d be interested to hear if anyone disagrees.

Recently I ran some tests
- with “throughput.sh” in [3] using concurrency of 10 (it will also be interesting to test with the --rps option in loadtest...)
- using a change that checks actions for an annotation “max-concurrent” (in case there is some reason actions want to enforce current behavior of strict serial invocation per container?)
- when scheduling an actions against the pool, if there is a currently “busy” container with this action, AND the annotation is present for this action, AND concurrent requests < max-concurrent, the this container is used to invoke the action

Below is a summary (approx 10x throughput with concurrent requests) and I would like to get some feedback on:
- what are the cases for having actions that require container isolation per request? node is a good example that should NOT need this, but maybe there are cases where it is more important, e.g. if there are cases where stateful actions are used?
- log collection approach: I have not attempted to resolve log collection issues; I would expect that revising the log sentinel marker to include the activation ID would help, and logs stored with the activation would include interleaved activations in some cases (which should be expected with concurrent request processing?), and require some different logic to process logs after an activation completes (e.g. logs emitted at the start of an activation may have already been collected as part of another activation’s log collection, etc).
- advice on creating a PR to discuss this in more detail - should I wait for more of the container pooling changes to get to master? Or submit a PR to Markus’s new-containerpool branch?

Thanks
Tyson

Summary of loadtest report with max-concurrent ENABLED (I used 10000, but this limit wasn’t reached):
[Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Target URL:          https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2F192.168.99.100%2Fapi%2Fv1%2Fnamespaces%2F_%2Factions%2FnoopThroughputConcurrent%3Fblocking%3Dtrue&data=02%7C01%7C%7C796dfc317cde44c9e83908d490ce7faa%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636292663971484169&sdata=uv9kYh5uBoIDXDlEivgMClJ6TDGDmzTdKOgZPZjkBko%3D&reserved=0
[Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Max requests:        10000
[Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Concurrency level:   10
[Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Agent:               keepalive
[Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO
[Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Completed requests:  10000
[Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Total errors:        0
[Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Total time:          241.900480915 s
[Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Requests per second: 41
[Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Mean latency:        241.7 ms

Summary of loadtest report with max-concurrent DISABLED:
[Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Target URL:          https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2F192.168.99.100%2Fapi%2Fv1%2Fnamespaces%2F_%2Factions%2FnoopThroughput%3Fblocking%3Dtrue&data=02%7C01%7C%7C796dfc317cde44c9e83908d490ce7faa%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636292663971494178&sdata=h6sMS0s2WQXFMcLg8sSAq%2F56p%2F%2BmVmth%2B%2FsqTOVmeAc%3D&reserved=0
[Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Max requests:        10000
[Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Concurrency level:   10
[Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Agent:               keepalive
[Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO
[Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Completed requests:  10000
[Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Total errors:        19
[Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Total time:          2770.658048791 s
[Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Requests per second: 4
[Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Mean latency:        2767.3 ms





[1] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fopenwhisk%2Fopenwhisk%2Fissues%2F2026&data=02%7C01%7C%7C796dfc317cde44c9e83908d490ce7faa%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636292663971494178&sdata=eg%2FsSPRQYapQHPNbfMLCW%2B%2F1yAqn8zSo0nJ5yQjmkns%3D&reserved=0
[2] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmarkusthoemmes%2Fopenwhisk%2Ftree%2Fnew-containerpool&data=02%7C01%7C%7C796dfc317cde44c9e83908d490ce7faa%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636292663971494178&sdata=IZcN9szW71SdL%2ByssJm9k3EgzaU4b5idI5yFWyR7%2BL4%3D&reserved=0
[3] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmarkusthoemmes%2Fopenwhisk-performance&data=02%7C01%7C%7C796dfc317cde44c9e83908d490ce7faa%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636292663971494178&sdata=WkOlhTsplKQm6mUkZtwWLXzCrQg%2FUmKtqOErIw6gFAA%3D&reserved=0


Re: concurrent requests on actions

Posted by Michael Marth <mm...@adobe.com>.
Hi Tyson,

10x more throughput, i.e. Being able to run OW at 1/10 of the cost - definitely worth looking into :)

Like Rodric mentioned before I figured some features might become more complex to implement, like billing, log collection, etc. But given such a huge advancement on throughput that would be worth it IMHO.
One thing I wonder about, though, is resilience against rogue actions. If an action is blocking (in the Node-sense, not the OW-sense), would that not block Node’s event loop and thus block other actions in that container? One could argue, though, that this rogue action would only block other executions of itself, not affect other actions or customers. WDYT?

Michael




On 01/05/17 17:54, "Tyson Norris" <tn...@adobe.com> wrote:

>Hi All -
>I created this issue some time ago to discuss concurrent requests on actions: [1] Some people mentioned discussing on the mailing list so I wanted to start that discussion.
>
>I’ve been doing some testing against this branch with Markus’s work on the new container pool: [2]
>I believe there are a few open PRs in upstream related to this work, but this seemed like a reasonable place to test against a variety of the reactive invoker and pool changes - I’d be interested to hear if anyone disagrees.
>
>Recently I ran some tests
>- with “throughput.sh” in [3] using concurrency of 10 (it will also be interesting to test with the --rps option in loadtest...)
>- using a change that checks actions for an annotation “max-concurrent” (in case there is some reason actions want to enforce current behavior of strict serial invocation per container?)
>- when scheduling an actions against the pool, if there is a currently “busy” container with this action, AND the annotation is present for this action, AND concurrent requests < max-concurrent, the this container is used to invoke the action
>
>Below is a summary (approx 10x throughput with concurrent requests) and I would like to get some feedback on:
>- what are the cases for having actions that require container isolation per request? node is a good example that should NOT need this, but maybe there are cases where it is more important, e.g. if there are cases where stateful actions are used?
>- log collection approach: I have not attempted to resolve log collection issues; I would expect that revising the log sentinel marker to include the activation ID would help, and logs stored with the activation would include interleaved activations in some cases (which should be expected with concurrent request processing?), and require some different logic to process logs after an activation completes (e.g. logs emitted at the start of an activation may have already been collected as part of another activation’s log collection, etc).
>- advice on creating a PR to discuss this in more detail - should I wait for more of the container pooling changes to get to master? Or submit a PR to Markus’s new-containerpool branch?
>
>Thanks
>Tyson
>
>Summary of loadtest report with max-concurrent ENABLED (I used 10000, but this limit wasn’t reached):
>[Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Target URL:          https://192.168.99.100/api/v1/namespaces/_/actions/noopThroughputConcurrent?blocking=true
>[Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Max requests:        10000
>[Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Concurrency level:   10
>[Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Agent:               keepalive
>[Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO
>[Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Completed requests:  10000
>[Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Total errors:        0
>[Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Total time:          241.900480915 s
>[Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Requests per second: 41
>[Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Mean latency:        241.7 ms
>
>Summary of loadtest report with max-concurrent DISABLED:
>[Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Target URL:          https://192.168.99.100/api/v1/namespaces/_/actions/noopThroughput?blocking=true
>[Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Max requests:        10000
>[Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Concurrency level:   10
>[Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Agent:               keepalive
>[Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO
>[Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Completed requests:  10000
>[Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Total errors:        19
>[Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Total time:          2770.658048791 s
>[Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Requests per second: 4
>[Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Mean latency:        2767.3 ms
>
>
>
>
>
>[1] https://github.com/openwhisk/openwhisk/issues/2026
>[2] https://github.com/markusthoemmes/openwhisk/tree/new-containerpool
>[3] https://github.com/markusthoemmes/openwhisk-performance