You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@openwhisk.apache.org by Dominic Kim <st...@gmail.com> on 2019/04/04 08:37:19 UTC

New architecture proposal

Hi.

I have proposed a new architecture.
https://cwiki.apache.org/confluence/display/OPENWHISK/New+architecture+proposal

It includes many controversial agendas and breaking changes.
So I would like to form a general consensus on them.

I'd really appreciate if you share any feedbacks about them.

In the mean time I would open a first PR.

Thanks
Regards
Dominic.

Re: New architecture proposal

Posted by Dominic Kim <st...@gmail.com>.
> I think it's definitely worth trying.
Agree. This will give us better options than just simple timeout based
clean-up.
I feel like you have some detail ideas on GC including prediction based
decision and it would be beyond what we can achieve with just SPIs.
Also, though I can make some SPIs for further enhancement, I am not sure it
can be aligned with what you are thinking.
So how about opening a PR in your side instead of preparing(SPIs and any
building blocks) for them by myself.

> If a container "survived", then GC will prolong its TTL with another
10mins, and so forth.

I think we need to make a consensus on your ideas.
It would be great if you write down a proposal in order to have the same
view and get more people involved in the discussion.
If the GC actor only checks the status every 10 mins, there could be many
edge cases.
For example, if some actions run for every 3~5 mins, their containers would
never be removed.
If we reduce GC intervals to a small value such as 10s, I feel it would
have a similar effect with what I suggested in my proposal.
Since I got only a part of your idea, I may misunderstand your intention.

Regarding cluster resources, yes I totally agree, that's exactly how the
cloud works.
We here also does not keep thousands of VMs all the time.
In my proposal, we can also make some activations to wait for containers to
be ready under such heavy loads.
I don't think it would take longer than the time to scale out the cluster
in the case of bursts because the timeout is generally a few seconds in my
proposal.
(It does not even have a paused state.)

I just mentioned it to say that's a natural tradeoff rather than the one
only resides in my proposal.
On-demand scale-out is what we are doing and I think this is a basic idea
behind the cloud.
But if we have strong SLA with some nonfunctional requirements, on-demand
scale-out may not be enough as it would anyway delay the execution.
We need to have some scale of clusters to meet the minimum requirements.


Best regards
Dominic

2019년 4월 13일 (토) 오전 6:34, Dascalita Dragos <dd...@gmail.com>님이 작성:

> "...I think GC implementation is orthogonal to my proposal.
> So if you can deal with such problems, I would gladly apply it as well..."
> I think it's definitely worth trying. The "mark-and-sweep" pattern with an
> eventual consistent state may give GC a good premise to make decisions.
> If GC is instantiated as a singleton action, through Akka, it can achieve a
> highly fault tolerant service, allowing for self-healing with no single
> point of failure.
>
> If an action changes state every "2ms" as in the example above, and the GC
> gets that change after "100ms", this could work fine; a timeout of "10mins"
> may give GC a decent time window to make decisions. The "mark-and-sweep"
> then allows containers ( ContainerProxy ) to move that container in the
> "survival" space - to use GC G1 language - in case the container processed
> an activation in between "mark" and "sweep" commands. If a container
> "survived", then GC will prolong its TTL with another 10mins, and so
> forth.  At the same time GC doesn't need to know about *all* state changes
> (busy->warm->etc), especially if they happen so frequently. This generates
> unnecessary noise in the network. GC could use a sampling strategy instead,
> or check for "last activation" *only* when the allocated memory in the
> cluster is over a threshold; there are a number of strategies that could
> ensure GC runs with "minimum enough info" and "as infrequent as needed" to
> make the right decisions.
>
> "...If we "must" guarantee the execution of users, we need to have a bigger
> cluster than the sum of users' limit....
>
> I agree. For the cases when an operator can't run such a large cluster to
> accommodate all namespaces regardless of their traffic, that's when a smart
> GC becomes critical to achieve cost efficiency. At least in the deployments
> I'm involved in, the cluster aims to shut-down most of its VMs when there's
> no traffic, and scale back up as quickly as possible as traffic increases.
> We run clusters in public clouds; we can't afford to keep thousands of VMs
> running if nobody uses them. When cluster scales up, there's a time window
> when there could be a congestion of resources; so, as with network
> congestion, activations should suffer for a little while, but at least they
> all get a fair chance to execute. They execute slower, but at least they
> execute; but if containers are left to destroy themselves, then it's hard
> to predict cluster behavior when congested, and it's hard to guarantee an
> SLA. This is where I was coming from.
>
> I'm sure we'll find a way to accommodate multiple deployment patterns. I'm
> describing my setup with the hope that the new architecture will allow a
> configuration mechanism or SPIs for such key areas.
>
> On Thu, Apr 11, 2019 at 8:15 PM Dominic Kim <st...@gmail.com> wrote:
>
> > Well, that is not the tradeoff only resides in my proposal.
> > If we "must" guarantee the execution of users, we need to have a bigger
> > cluster than the sum of users' limit.
> > Even though we use current implementation, if the sum of concurrent limit
> > exceeds the system resources, we cannot completely guarantee all
> executions
> > under a burst.
> > (We cannot guarantee 200 concurrent execution with 100 containers.)
> >
> > The reason why I used a kind of self-GC is, it's not easy to track the
> > resource status in real time.
> > It is the same with the reason why I take the pull-based model.
> >
> > For example, if we control the container deletion in a central way, we
> > should track all container status such as how many containers are running
> > for each action, which of them are idle, where they are running, and so
> on.
> > But the status of resources changes blazingly fast.
> > Below is one example. The execution is over within 2 ms.
> > {
> >     "namespace": "style95",
> >     "name": "hello-world",
> >     "version": "0.0.1",
> >     "subject": "style95",
> >     "activationId": "ba2cc561fc8e4272acc561fc8ea27210",
> >     "start": 1554967214351,
> >     "end": 1554967214353,
> >     "duration": 2,
> >     "response": {
> >         "status": "success",
> >         "statusCode": 0,
> > .
> > .
> > .
> > }
> >
> > It means the container status(busy, warm) can change every 2 ms.
> >
> > Also, if you are not thinking of one central component which decides to
> > delete containers,(it can be a SPOF) there will be multiple components
> > which are making the decision.
> > When one of them makes a decision, it should consider decisions made(and
> > will be made) by the others at the same time.
> > So we need to track down all container status, consider all decision made
> > by multiple components and finally make the optimal decision to delete
> > containers within 2 ms.
> > I think this is not viable.
> >
> > Your idea sounds great, it could be a great enhancement.
> > And I think GC implementation is orthogonal to my proposal.
> > So if you can deal with such problems, I would gladly apply it as well.
> >
> >
> > Best regards
> > Dominic
> >
> >
> >
> >
> >
> >
> > 2019년 4월 11일 (목) 오후 3:15, Dascalita Dragos <dd...@gmail.com>님이 작성:
> >
> > > Thanks Dominic for the details.
> > >
> > > It seems like an operator has to choose between “do I hurt
> > performance(low
> > > timeout) or do I hurt the SLA” ?
> > >
> > > If this is the trade off , isn’t this a hard choice to make ? So I’m
> > > wondering whether some alternative designs could be used for this
> > problem.
> > >
> > > The key decision here is: should OW be given a cluster wide power to
> view
> > > and control the resources or not. IIUC the current proposal doesn’t
> > support
> > > this? I’m not saying the proposed model is not good; I’d just feel more
> > > comfortable if OW would allow more options instead of one, in the same
> > way
> > > the JVM allows multiple GC implementations. In the proposed model the
> GC
> > > would offload the decision to each container, while other
> implementations
> > > may do it differently. For instance,  I’d implement something dynamic
> > that
> > > adapts the timeout to the load, and maybe try some predictive ML
> > algorithms
> > > to manage resources - if a model suggests that out of 3 actions that
> > could
> > > be removed, 1 has a higher probability to be invoked again, wouldn’t it
> > be
> > > more efficient to remove one of the other 2 ? Such a logic can only be
> > > achieved through an entity with a cluster wide view, as actions don’t
> > know
> > > about each other, to negotiate a dynamic timeout.
> > >
> > > - dragos
> > >
> > > On Wed, Apr 10, 2019 at 3:46 AM Dominic Kim <st...@gmail.com>
> wrote:
> > >
> > > > Dear Dascalita
> > > >
> > > > That depends on the timeout configuration.
> > > > For example, if you need something similar to the one in the current
> > code
> > > > base, you can just configure the timeout to a small enough value,
> such
> > as
> > > > 50ms.
> > > >
> > > > The idea behind the longer timeout is, it shows better performance
> when
> > > > there are highly likely subsequent requests.
> > > > For example, it takes about 100ms ~ 1s to create a new coldstart
> > > container.
> > > > If the action execution takes 10ms, it should wait 10 to 100 times
> more
> > > for
> > > > a new container.
> > > > In this case, it is reasonable to wait for the previous execution and
> > > reuse
> > > > the existing container rather than creating a new container.
> > > > So 100ms ~ 1s could be good options for the timeout value.
> > > > (Under heavy loads, I even observed it took 2s ~ 5s to create a
> > coldstart
> > > > container.)
> > > > And this implies some changes in the notion of resources.
> > > >
> > > > In the cluster, there would be a different kind of requests.
> > > > There would be both batch and real-time invocation.
> > > > So I think this is a tradeoff.
> > > > Longer timeout will increase the reuse rate of containers but idle
> > > > containers will possess resources longer.
> > > >
> > > > And even in the current implementation, subsequent invocation should
> > wait
> > > > for some time to remove existing(warmed containers) and create a new
> > cold
> > > > start container.
> > > > As I said, it could take up to few seconds under heavy loads.
> > > > With reasonable timeout value, there would be no big performance
> > > difference
> > > > in the above situation.
> > > > (Actually, I expect new scheduler would outperform even with 5~10s
> > > timeout
> > > > value as it will evenly distribute docker operation.
> > > > In the current implementation, all execution is sent to the home
> > invoker
> > > > first and it could make the situation worse in edge cases.
> > > > I hope I can share performance comparison results as I am doing
> > > > benchmarking.)
> > > >
> > > > Also, I think the above case is an edge case that someone is
> consuming
> > > most
> > > > of the cluster resources and executing two different batch invocation
> > > > alternatively.
> > > > Anyway, we can support such an edge case with some shutdown period.
> > > > This can be controversial, but I believe this is a viable option.
> > > >
> > > >
> > > > If you said that in the view of OpenWhisk operator, I think you meant
> > > there
> > > > are more than 1 heavy users.
> > > > Let's say, one user has 60 containers limit and the other has 80
> > > containers
> > > > limit.
> > > > Then can we guarantee both users' execution without any issue in
> > current
> > > > implementation?
> > > > If their invocation requests come together, one or both of their
> > > invocation
> > > > will be heavily delayed.
> > > >
> > > > So I think when we(operators) notice there are such heavy users, we
> > > should
> > > > scale out our clusters to guarantee their invocation or we should
> > reduce
> > > > their resource limit.
> > > > This is also a tradeoff. If we must guarantee their invocation, we at
> > > least
> > > > need a bigger cluster than the sum of their throttling limit.
> > > > If we have weak SLA, we can support both users with smaller cluster
> > > though
> > > > their invocation can be delayed a bit.
> > > >
> > > >
> > > > In short, if you prefer the current behavior you can still have a
> > similar
> > > > effect by configuring the timeout as 50ms.
> > > > (So containers will only wait for 50ms, though it may induce some
> > > > performance degradation in other cases.)
> > > >
> > > > Best regards
> > > > Dominic
> > > >
> > > >
> > > > 2019년 4월 10일 (수) 오전 1:36, Dascalita Dragos <dd...@gmail.com>님이
> 작성:
> > > >
> > > > > "...When there is no more activation message, ContainerProxy will
> be
> > > wait
> > > > > for the given time(configurable) and just stop...."
> > > > >
> > > > > How does the system allocate and de-allocate resources when it's
> > > > congested
> > > > > ?
> > > > > I'm thinking at the use case where the system receives a batch of
> > > > > activations that require 60% of all cluster resources. Once those
> > > > > activations finish, a different batch of activations are received,
> > and
> > > > this
> > > > > time the new batch requires new actions to be cold-started; these
> new
> > > > > activations require a total of 80% of the overall cluster
> resources.
> > > > Unless
> > > > > the previous actions are removed, the cluster is over-allocated. In
> > the
> > > > > current model would the cluster process 1/2 of the new activations
> > b/c
> > > it
> > > > > needs to wait for the previous actions to stop by themselves ?
> > > > >
> > > > > On Sun, Apr 7, 2019 at 7:34 PM Dominic Kim <st...@gmail.com>
> > > wrote:
> > > > >
> > > > > > Hi Mingyu
> > > > > >
> > > > > > Thank you for the good questions.
> > > > > >
> > > > > > Before answering to your question, I will share the Lease in ETCD
> > > > first.
> > > > > > ETCD has a data model which is disappear after given time if
> there
> > is
> > > > no
> > > > > > relevant keepalive on it, the Lease.
> > > > > >
> > > > > > So once you grant a new lease, you can put it with data in each
> > > > operation
> > > > > > such as put, putTxn(transaction), etc.
> > > > > > If there is no keep-alive for the given(configurable) time,
> > inserted
> > > > data
> > > > > > will be gone.
> > > > > >
> > > > > > In my proposal, most of data in ETCD rely on a lease.
> > > > > > For example, each scheduler stores their endpoint information(for
> > > queue
> > > > > > creation) with a lease. Each queue stores their information(for
> > > > > activation)
> > > > > > in ETCD with a lease.
> > > > > > (It is overhead to do keep-alive in each memory queue
> separately, I
> > > > > > introduced EtcdKeepAliveService to share one global lease among
> all
> > > > > queues
> > > > > > in a same scheduler.)
> > > > > > Each ContainerProxy store their information in ETCD with a lease
> so
> > > > that
> > > > > > when a queue tries to create a container, they can easily count
> the
> > > > > number
> > > > > > of existing containers with "Count" API.
> > > > > > Both data are stored with a lease, if one scheduler or invoker
> are
> > > > > failed,
> > > > > > keep-alive for the given lease is not continued, and finally
> those
> > > data
> > > > > > will be removed.
> > > > > >
> > > > > > Follower queues are watching on the leader queue information. If
> > > there
> > > > > are
> > > > > > any changes,(the data will be removed upon scheduler failure)
> they
> > > can
> > > > > > receive the notification and start new leader election.
> > > > > > When a scheduler is failed, ContainerProxys which were
> > communicating
> > > > > with a
> > > > > > queue in that scheduler, will receive a connection error.
> > > > > > Then they are designed to access to the ETCD again to figure out
> > the
> > > > > > endpoint of the leader queue.
> > > > > > As one of followers becomes a new leader, ContainerProxys can
> > connect
> > > > to
> > > > > > the new leader.
> > > > > >
> > > > > > One thing to note here is, there is only one QueueManager in each
> > > > > > scheduler.
> > > > > > One QueueManager holds all queues and delegate requests to the
> > proper
> > > > > queue
> > > > > > in respond to "fetch" requests.
> > > > > >
> > > > > > In short, all endpoints data are stored in ETCD and they are
> > renewed
> > > > > based
> > > > > > on keep-alive and lease.
> > > > > > Each components are designed to access ETCD when the failure
> > detected
> > > > and
> > > > > > connect to a new(failed-over) scheduler.
> > > > > >
> > > > > > I hope it is useful to you.
> > > > > > And I think when I and my colleagues open PRs, we need to add
> more
> > > > detail
> > > > > > design along with them.
> > > > > >
> > > > > > If you have any further questions, kindly let me know.
> > > > > >
> > > > > > Thanks
> > > > > > Best regards
> > > > > > Dominic
> > > > > >
> > > > > >
> > > > > >
> > > > > > 2019년 4월 6일 (토) 오전 11:28, Mingyu Zhou <zh...@gmail.com>님이 작성:
> > > > > >
> > > > > > > Dear Dominic,
> > > > > > >
> > > > > > > Thanks for your proposal. It is very inspirational and it looks
> > > > > > promising.
> > > > > > >
> > > > > > > I'd like to ask some questions about the fall over/failure
> > recovery
> > > > > > > mechanism of the scheduler component.
> > > > > > >
> > > > > > > IIUC, a scheduler instance hosts multiple queue managers. If a
> > > > > scheduler
> > > > > > is
> > > > > > > down, we will lose multiple queue managers. Thus, there should
> be
> > > > some
> > > > > > form
> > > > > > > of failure recovery of queue managers and it raises the
> following
> > > > > > > questions:
> > > > > > >
> > > > > > > 1. In your proposal, how the failure of a scheduler is
> detected?
> > > > I.e.,
> > > > > > > when a scheduler instance is down and some queue manager become
> > > > > > > unreachable, which component will be aware of this
> unavailability
> > > and
> > > > > > then
> > > > > > > trigger the recovery procedure?
> > > > > > >
> > > > > > > 2. How should the failure be recovered and lost queue managers
> be
> > > > > brought
> > > > > > > back to life? Specifically, in your proposal, you designed a
> hot
> > > > > > > standing-by pairing of queue managers (one leader/two
> followers).
> > > > Then
> > > > > > how
> > > > > > > should the new leader be selected in face of scheduler crash?
> And
> > > do
> > > > we
> > > > > > > need to allocate a new queue manager to maintain the
> > > > > > > one-leader-two-follower configuration?
> > > > > > >
> > > > > > > 3. How will the other components in the system learn the new
> > > > > > configuration
> > > > > > > after a fall over? For example, how will the pool balancer
> > discover
> > > > the
> > > > > > new
> > > > > > > state of the scheduler it managers and change its policy to
> > > > distribute
> > > > > > > queue creation requests?
> > > > > > >
> > > > > > > Thanks
> > > > > > > Mingyu Zhou
> > > > > > >
> > > > > > > On Fri, Apr 5, 2019 at 2:56 PM Dominic Kim <
> style9595@gmail.com>
> > > > > wrote:
> > > > > > >
> > > > > > > > Dear David, Matt, and Dascalita.
> > > > > > > > Thank you for your interest in my proposal.
> > > > > > > >
> > > > > > > > Let me answer your questions one by one.
> > > > > > > >
> > > > > > > > @David
> > > > > > > > Yes, I will(and actually already did) implement all
> components
> > > > based
> > > > > on
> > > > > > > > SPI.
> > > > > > > > The reason why I said "breaking changes" is, my proposal will
> > > > affect
> > > > > > most
> > > > > > > > of components drastically.
> > > > > > > > For example, InvokerReactive will become a SPI and current
> > > > > > > InvokerReactive
> > > > > > > > will become one of its concrete implementation.
> > > > > > > > My load balancer and throttler are also based on the current
> > SPI.
> > > > > > > > So though my implementation would be included in OpenWhisk,
> > > > > downstreams
> > > > > > > > still can take advantage of existing implementation such as
> > > > > > > > ShardingPoolBalancer.
> > > > > > > >
> > > > > > > > Regarding Leader/Follower, a fair point.
> > > > > > > > The reason why I introduced such a model is to prepare for
> the
> > > > future
> > > > > > > > enhancement.
> > > > > > > > Actually, I reached a conclusion that memory based activation
> > > > passing
> > > > > > > would
> > > > > > > > be enough for OpenWhisk in terms of message persistence.
> > > > > > > > But it is just my own opinion and community may want more
> rigid
> > > > level
> > > > > > of
> > > > > > > > persistence.
> > > > > > > > I naively thought we can add replication and HA logic in the
> > > queue
> > > > > > which
> > > > > > > > are similar to the one in Kafka.
> > > > > > > > And Leader/Follower would be a good base building block for
> > this.
> > > > > > > >
> > > > > > > > Currently only a leader fetch activation messages from Kafka.
> > > > > Followers
> > > > > > > > will be idle while watching the leadership change.
> > > > > > > > Once the leadership is changed, one of followers will become
> a
> > > new
> > > > > > leader
> > > > > > > > and at that time, Kafka consumer for the new leader will be
> > > > created.
> > > > > > > > This is to minimize the failure handling time in the aspect
> of
> > > > > clients
> > > > > > as
> > > > > > > > you mentioned. It is also correct that this flow does not
> > prevent
> > > > > > > > activation messages lost on scheduler failure.
> > > > > > > > But it's not that complex as activation messages are not
> > > replicated
> > > > > to
> > > > > > > > followers and the number of followers are configurable.
> > > > > > > > If we want, we can configure the number of required queue to
> 1,
> > > > there
> > > > > > > will
> > > > > > > > be only one leader queue.
> > > > > > > > (If we ok with the current level of persistence, I think we
> may
> > > not
> > > > > > need
> > > > > > > > more than 1 queue as you said.)
> > > > > > > >
> > > > > > > > Regarding pulling activation messages, each action will have
> > its
> > > > own
> > > > > > > Kafka
> > > > > > > > topic.
> > > > > > > > It is same with what I proposed in my previous proposals.
> > > > > > > > When an action is created, a Kafka topic for the action will
> be
> > > > > > created.
> > > > > > > > So each leader queue(consumer) will fetch activation messages
> > > from
> > > > > its
> > > > > > > own
> > > > > > > > Kafka topic and there would be no intervention among actions.
> > > > > > > >
> > > > > > > > When I and my colleagues open PRs for each component, we will
> > add
> > > > > > detail
> > > > > > > > component design.
> > > > > > > > It would help you guys understand the proposal more.
> > > > > > > >
> > > > > > > > @Matt
> > > > > > > > Thank you for the suggestion.
> > > > > > > > If I change the name of it now, it would break the link in
> this
> > > > > thread.
> > > > > > > > I would use the name you suggested when I open a PR.
> > > > > > > >
> > > > > > > >
> > > > > > > > @Dascalita
> > > > > > > >
> > > > > > > > Interesting idea.
> > > > > > > > Any GC patterns do you keep in your mind to apply in
> OpenWhisk?
> > > > > > > >
> > > > > > > > In my proposal, the container GC is similar to what OpenWhisk
> > > does
> > > > > > these
> > > > > > > > days.
> > > > > > > > Each container will autonomously fetch activations from the
> > > queue.
> > > > > > > > Whenever they finish invocation of one activation, they will
> > > fetch
> > > > > the
> > > > > > > next
> > > > > > > > request and invoke it.
> > > > > > > > In this sense, we can maximize the container reuse.
> > > > > > > >
> > > > > > > > When there is no more activation message, ContainerProxy will
> > be
> > > > wait
> > > > > > for
> > > > > > > > the given time(configurable) and just stop.
> > > > > > > > One difference is containers are no more paused, they are
> just
> > > > > removed.
> > > > > > > > Instead of pausing them, containers are waiting for
> subsequent
> > > > > requests
> > > > > > > for
> > > > > > > > longer time(5~10s) than current implementation.
> > > > > > > > This is because pausing is also relatively expensive
> operation
> > in
> > > > > > > > comparison to short-running invocation.
> > > > > > > >
> > > > > > > > Container lifecycle is managed in this way.
> > > > > > > > 1. When a container is created, it will add its information
> in
> > > > ETCD.
> > > > > > > > 2. A queue will count the existing number of containers using
> > > above
> > > > > > > > information.
> > > > > > > > 3. Under heavy loads, the queue will request more containers
> if
> > > the
> > > > > > > number
> > > > > > > > of existing containers is less than its resource limit.
> > > > > > > > 4. When the container is removed, it will delete its
> > information
> > > in
> > > > > > ETCD.
> > > > > > > >
> > > > > > > >
> > > > > > > > Again, I really appreciate all your feedbacks and questions.
> > > > > > > > If you have any further questions, kindly let me know.
> > > > > > > >
> > > > > > > > Best regards
> > > > > > > > Dominic
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > 2019년 4월 5일 (금) 오전 1:24, Dascalita Dragos <
> ddragosd@gmail.com
> > >님이
> > > > 작성:
> > > > > > > >
> > > > > > > > > Hi Dominic,
> > > > > > > > > Thanks for sharing your ideas. IIUC (and pls keep me
> honest),
> > > the
> > > > > > goal
> > > > > > > of
> > > > > > > > > the new design is to improve activation performance. I
> > > personally
> > > > > > love
> > > > > > > > > this; performance is a critical non-functional feature of
> any
> > > > FaaS
> > > > > > > > system.
> > > > > > > > >
> > > > > > > > > There’s something I’d like to call out: the management of
> > > > > containers
> > > > > > > in a
> > > > > > > > > FaaS system could be compared to a JVM. A JVM allocates
> > objects
> > > > in
> > > > > > > > memory,
> > > > > > > > > and GC them. A FaaS system allocates containers to run
> > actions,
> > > > and
> > > > > > it
> > > > > > > > GCs
> > > > > > > > > them when they become idle. If we could look at OW's
> > scheduling
> > > > > from
> > > > > > > this
> > > > > > > > > perspective, we could reuse the proven patterns in the JVM
> vs
> > > > > > inventing
> > > > > > > > > something new. I’d be interested on any GC implications in
> > the
> > > > new
> > > > > > > > design,
> > > > > > > > > meaning how idle actions get removed, and how is that
> > > > orchestrated.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > dragos
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Thu, Apr 4, 2019 at 8:40 AM Matt Sicker <
> boards@gmail.com
> > >
> > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Would it make sense to define an OpenWhisk
> > > > > Improvement/Enhancement
> > > > > > > > > > Propoposal or similar that various other Apache projects
> > do?
> > > We
> > > > > > could
> > > > > > > > > > call them WHIPs or something. :)
> > > > > > > > > >
> > > > > > > > > > On Thu, 4 Apr 2019 at 09:09, David P Grove <
> > > groved@us.ibm.com>
> > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Dominic Kim <st...@gmail.com> wrote on 04/04/2019
> > > > 04:37:19
> > > > > > AM:
> > > > > > > > > > > >
> > > > > > > > > > > > I have proposed a new architecture.
> > > > > > > > > > > >
> > > > > > > > >
> > > > > >
> > > https://cwiki.apache.org/confluence/display/OPENWHISK/New+architecture
> > > > > > > > > > > +proposal
> > > > > > > > > > > >
> > > > > > > > > > > > It includes many controversial agendas and breaking
> > > > changes.
> > > > > > > > > > > > So I would like to form a general consensus on them.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Hi Dominic,
> > > > > > > > > > >
> > > > > > > > > > >         There's much to like about the proposal.  Thank
> > you
> > > > for
> > > > > > > > writing
> > > > > > > > > > it
> > > > > > > > > > > up.
> > > > > > > > > > >
> > > > > > > > > > >         One meta-comment is that the work will have to
> be
> > > > done
> > > > > > in a
> > > > > > > > way
> > > > > > > > > > so
> > > > > > > > > > > there are no actual "breaking changes".  It has to be
> > > > possible
> > > > > to
> > > > > > > > > > continue
> > > > > > > > > > > to configure the system using the existing
> architectures
> > > > while
> > > > > > this
> > > > > > > > > work
> > > > > > > > > > > proceeds.  I would expect this could be done via a new
> > > > > > LoadBalancer
> > > > > > > > and
> > > > > > > > > > > some deployment options (similar to how Lean OpenWhisk
> > was
> > > > > done).
> > > > > > > If
> > > > > > > > > > work
> > > > > > > > > > > needs to be done to generalize the LoadBalancer SPI,
> that
> > > > could
> > > > > > be
> > > > > > > > done
> > > > > > > > > > > early in the process.
> > > > > > > > > > >
> > > > > > > > > > >         On the proposal itself, I wonder if the
> > complexity
> > > of
> > > > > > > > > > Leader/Follower
> > > > > > > > > > > is actually needed?  If a Scheduler crashes, it could
> be
> > > > > > restarted
> > > > > > > > and
> > > > > > > > > > then
> > > > > > > > > > > resume handling its assigned load.  I think there
> should
> > be
> > > > > > enough
> > > > > > > > > > > information in etcd for it to recover its current set
> of
> > > > > assigned
> > > > > > > > > > > ContainerProxys and carry on.   Activations in its in
> > > memory
> > > > > > queues
> > > > > > > > > would
> > > > > > > > > > > be lost (bigger blast radius than the current
> > > architecture),
> > > > > but
> > > > > > I
> > > > > > > > > don't
> > > > > > > > > > > see that the Leader/Follower changes that (seems way
> too
> > > > > > expensive
> > > > > > > to
> > > > > > > > > be
> > > > > > > > > > > replicating every activation in the Follower Queues).
> >  The
> > > > > > > > > > Leader/Follower
> > > > > > > > > > > would allow for shorter downtime for those actions
> > assigned
> > > > to
> > > > > > the
> > > > > > > > > downed
> > > > > > > > > > > Scheduler, but at the cost of significant complexity.
> Is
> > > it
> > > > > > worth
> > > > > > > > it?
> > > > > > > > > > >
> > > > > > > > > > >         Perhaps related to the Leader/Follower, its not
> > > clear
> > > > > to
> > > > > > me
> > > > > > > > how
> > > > > > > > > > > activation messages are being pulled from the action
> > topic
> > > in
> > > > > > Kafka
> > > > > > > > > > during
> > > > > > > > > > > the Queue creation window. I think they have to go
> > > somewhere
> > > > > > > (because
> > > > > > > > > the
> > > > > > > > > > > is a mix of actions on a single Kafka topic and we
> can't
> > > > stall
> > > > > > > other
> > > > > > > > > > > actions while waiting for a Queue to be created for a
> new
> > > > > > action),
> > > > > > > > but
> > > > > > > > > if
> > > > > > > > > > > you don't know yet which Scheduler is going to win the
> > race
> > > > to
> > > > > > be a
> > > > > > > > > > Leader
> > > > > > > > > > > how do you know where to put them?
> > > > > > > > > > >
> > > > > > > > > > > --dave
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Matt Sicker <bo...@gmail.com>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > 周明宇
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: New architecture proposal

Posted by Dascalita Dragos <dd...@gmail.com>.
"...I think GC implementation is orthogonal to my proposal.
So if you can deal with such problems, I would gladly apply it as well..."
I think it's definitely worth trying. The "mark-and-sweep" pattern with an
eventual consistent state may give GC a good premise to make decisions.
If GC is instantiated as a singleton action, through Akka, it can achieve a
highly fault tolerant service, allowing for self-healing with no single
point of failure.

If an action changes state every "2ms" as in the example above, and the GC
gets that change after "100ms", this could work fine; a timeout of "10mins"
may give GC a decent time window to make decisions. The "mark-and-sweep"
then allows containers ( ContainerProxy ) to move that container in the
"survival" space - to use GC G1 language - in case the container processed
an activation in between "mark" and "sweep" commands. If a container
"survived", then GC will prolong its TTL with another 10mins, and so
forth.  At the same time GC doesn't need to know about *all* state changes
(busy->warm->etc), especially if they happen so frequently. This generates
unnecessary noise in the network. GC could use a sampling strategy instead,
or check for "last activation" *only* when the allocated memory in the
cluster is over a threshold; there are a number of strategies that could
ensure GC runs with "minimum enough info" and "as infrequent as needed" to
make the right decisions.

"...If we "must" guarantee the execution of users, we need to have a bigger
cluster than the sum of users' limit....

I agree. For the cases when an operator can't run such a large cluster to
accommodate all namespaces regardless of their traffic, that's when a smart
GC becomes critical to achieve cost efficiency. At least in the deployments
I'm involved in, the cluster aims to shut-down most of its VMs when there's
no traffic, and scale back up as quickly as possible as traffic increases.
We run clusters in public clouds; we can't afford to keep thousands of VMs
running if nobody uses them. When cluster scales up, there's a time window
when there could be a congestion of resources; so, as with network
congestion, activations should suffer for a little while, but at least they
all get a fair chance to execute. They execute slower, but at least they
execute; but if containers are left to destroy themselves, then it's hard
to predict cluster behavior when congested, and it's hard to guarantee an
SLA. This is where I was coming from.

I'm sure we'll find a way to accommodate multiple deployment patterns. I'm
describing my setup with the hope that the new architecture will allow a
configuration mechanism or SPIs for such key areas.

On Thu, Apr 11, 2019 at 8:15 PM Dominic Kim <st...@gmail.com> wrote:

> Well, that is not the tradeoff only resides in my proposal.
> If we "must" guarantee the execution of users, we need to have a bigger
> cluster than the sum of users' limit.
> Even though we use current implementation, if the sum of concurrent limit
> exceeds the system resources, we cannot completely guarantee all executions
> under a burst.
> (We cannot guarantee 200 concurrent execution with 100 containers.)
>
> The reason why I used a kind of self-GC is, it's not easy to track the
> resource status in real time.
> It is the same with the reason why I take the pull-based model.
>
> For example, if we control the container deletion in a central way, we
> should track all container status such as how many containers are running
> for each action, which of them are idle, where they are running, and so on.
> But the status of resources changes blazingly fast.
> Below is one example. The execution is over within 2 ms.
> {
>     "namespace": "style95",
>     "name": "hello-world",
>     "version": "0.0.1",
>     "subject": "style95",
>     "activationId": "ba2cc561fc8e4272acc561fc8ea27210",
>     "start": 1554967214351,
>     "end": 1554967214353,
>     "duration": 2,
>     "response": {
>         "status": "success",
>         "statusCode": 0,
> .
> .
> .
> }
>
> It means the container status(busy, warm) can change every 2 ms.
>
> Also, if you are not thinking of one central component which decides to
> delete containers,(it can be a SPOF) there will be multiple components
> which are making the decision.
> When one of them makes a decision, it should consider decisions made(and
> will be made) by the others at the same time.
> So we need to track down all container status, consider all decision made
> by multiple components and finally make the optimal decision to delete
> containers within 2 ms.
> I think this is not viable.
>
> Your idea sounds great, it could be a great enhancement.
> And I think GC implementation is orthogonal to my proposal.
> So if you can deal with such problems, I would gladly apply it as well.
>
>
> Best regards
> Dominic
>
>
>
>
>
>
> 2019년 4월 11일 (목) 오후 3:15, Dascalita Dragos <dd...@gmail.com>님이 작성:
>
> > Thanks Dominic for the details.
> >
> > It seems like an operator has to choose between “do I hurt
> performance(low
> > timeout) or do I hurt the SLA” ?
> >
> > If this is the trade off , isn’t this a hard choice to make ? So I’m
> > wondering whether some alternative designs could be used for this
> problem.
> >
> > The key decision here is: should OW be given a cluster wide power to view
> > and control the resources or not. IIUC the current proposal doesn’t
> support
> > this? I’m not saying the proposed model is not good; I’d just feel more
> > comfortable if OW would allow more options instead of one, in the same
> way
> > the JVM allows multiple GC implementations. In the proposed model the GC
> > would offload the decision to each container, while other implementations
> > may do it differently. For instance,  I’d implement something dynamic
> that
> > adapts the timeout to the load, and maybe try some predictive ML
> algorithms
> > to manage resources - if a model suggests that out of 3 actions that
> could
> > be removed, 1 has a higher probability to be invoked again, wouldn’t it
> be
> > more efficient to remove one of the other 2 ? Such a logic can only be
> > achieved through an entity with a cluster wide view, as actions don’t
> know
> > about each other, to negotiate a dynamic timeout.
> >
> > - dragos
> >
> > On Wed, Apr 10, 2019 at 3:46 AM Dominic Kim <st...@gmail.com> wrote:
> >
> > > Dear Dascalita
> > >
> > > That depends on the timeout configuration.
> > > For example, if you need something similar to the one in the current
> code
> > > base, you can just configure the timeout to a small enough value, such
> as
> > > 50ms.
> > >
> > > The idea behind the longer timeout is, it shows better performance when
> > > there are highly likely subsequent requests.
> > > For example, it takes about 100ms ~ 1s to create a new coldstart
> > container.
> > > If the action execution takes 10ms, it should wait 10 to 100 times more
> > for
> > > a new container.
> > > In this case, it is reasonable to wait for the previous execution and
> > reuse
> > > the existing container rather than creating a new container.
> > > So 100ms ~ 1s could be good options for the timeout value.
> > > (Under heavy loads, I even observed it took 2s ~ 5s to create a
> coldstart
> > > container.)
> > > And this implies some changes in the notion of resources.
> > >
> > > In the cluster, there would be a different kind of requests.
> > > There would be both batch and real-time invocation.
> > > So I think this is a tradeoff.
> > > Longer timeout will increase the reuse rate of containers but idle
> > > containers will possess resources longer.
> > >
> > > And even in the current implementation, subsequent invocation should
> wait
> > > for some time to remove existing(warmed containers) and create a new
> cold
> > > start container.
> > > As I said, it could take up to few seconds under heavy loads.
> > > With reasonable timeout value, there would be no big performance
> > difference
> > > in the above situation.
> > > (Actually, I expect new scheduler would outperform even with 5~10s
> > timeout
> > > value as it will evenly distribute docker operation.
> > > In the current implementation, all execution is sent to the home
> invoker
> > > first and it could make the situation worse in edge cases.
> > > I hope I can share performance comparison results as I am doing
> > > benchmarking.)
> > >
> > > Also, I think the above case is an edge case that someone is consuming
> > most
> > > of the cluster resources and executing two different batch invocation
> > > alternatively.
> > > Anyway, we can support such an edge case with some shutdown period.
> > > This can be controversial, but I believe this is a viable option.
> > >
> > >
> > > If you said that in the view of OpenWhisk operator, I think you meant
> > there
> > > are more than 1 heavy users.
> > > Let's say, one user has 60 containers limit and the other has 80
> > containers
> > > limit.
> > > Then can we guarantee both users' execution without any issue in
> current
> > > implementation?
> > > If their invocation requests come together, one or both of their
> > invocation
> > > will be heavily delayed.
> > >
> > > So I think when we(operators) notice there are such heavy users, we
> > should
> > > scale out our clusters to guarantee their invocation or we should
> reduce
> > > their resource limit.
> > > This is also a tradeoff. If we must guarantee their invocation, we at
> > least
> > > need a bigger cluster than the sum of their throttling limit.
> > > If we have weak SLA, we can support both users with smaller cluster
> > though
> > > their invocation can be delayed a bit.
> > >
> > >
> > > In short, if you prefer the current behavior you can still have a
> similar
> > > effect by configuring the timeout as 50ms.
> > > (So containers will only wait for 50ms, though it may induce some
> > > performance degradation in other cases.)
> > >
> > > Best regards
> > > Dominic
> > >
> > >
> > > 2019년 4월 10일 (수) 오전 1:36, Dascalita Dragos <dd...@gmail.com>님이 작성:
> > >
> > > > "...When there is no more activation message, ContainerProxy will be
> > wait
> > > > for the given time(configurable) and just stop...."
> > > >
> > > > How does the system allocate and de-allocate resources when it's
> > > congested
> > > > ?
> > > > I'm thinking at the use case where the system receives a batch of
> > > > activations that require 60% of all cluster resources. Once those
> > > > activations finish, a different batch of activations are received,
> and
> > > this
> > > > time the new batch requires new actions to be cold-started; these new
> > > > activations require a total of 80% of the overall cluster resources.
> > > Unless
> > > > the previous actions are removed, the cluster is over-allocated. In
> the
> > > > current model would the cluster process 1/2 of the new activations
> b/c
> > it
> > > > needs to wait for the previous actions to stop by themselves ?
> > > >
> > > > On Sun, Apr 7, 2019 at 7:34 PM Dominic Kim <st...@gmail.com>
> > wrote:
> > > >
> > > > > Hi Mingyu
> > > > >
> > > > > Thank you for the good questions.
> > > > >
> > > > > Before answering to your question, I will share the Lease in ETCD
> > > first.
> > > > > ETCD has a data model which is disappear after given time if there
> is
> > > no
> > > > > relevant keepalive on it, the Lease.
> > > > >
> > > > > So once you grant a new lease, you can put it with data in each
> > > operation
> > > > > such as put, putTxn(transaction), etc.
> > > > > If there is no keep-alive for the given(configurable) time,
> inserted
> > > data
> > > > > will be gone.
> > > > >
> > > > > In my proposal, most of data in ETCD rely on a lease.
> > > > > For example, each scheduler stores their endpoint information(for
> > queue
> > > > > creation) with a lease. Each queue stores their information(for
> > > > activation)
> > > > > in ETCD with a lease.
> > > > > (It is overhead to do keep-alive in each memory queue separately, I
> > > > > introduced EtcdKeepAliveService to share one global lease among all
> > > > queues
> > > > > in a same scheduler.)
> > > > > Each ContainerProxy store their information in ETCD with a lease so
> > > that
> > > > > when a queue tries to create a container, they can easily count the
> > > > number
> > > > > of existing containers with "Count" API.
> > > > > Both data are stored with a lease, if one scheduler or invoker are
> > > > failed,
> > > > > keep-alive for the given lease is not continued, and finally those
> > data
> > > > > will be removed.
> > > > >
> > > > > Follower queues are watching on the leader queue information. If
> > there
> > > > are
> > > > > any changes,(the data will be removed upon scheduler failure) they
> > can
> > > > > receive the notification and start new leader election.
> > > > > When a scheduler is failed, ContainerProxys which were
> communicating
> > > > with a
> > > > > queue in that scheduler, will receive a connection error.
> > > > > Then they are designed to access to the ETCD again to figure out
> the
> > > > > endpoint of the leader queue.
> > > > > As one of followers becomes a new leader, ContainerProxys can
> connect
> > > to
> > > > > the new leader.
> > > > >
> > > > > One thing to note here is, there is only one QueueManager in each
> > > > > scheduler.
> > > > > One QueueManager holds all queues and delegate requests to the
> proper
> > > > queue
> > > > > in respond to "fetch" requests.
> > > > >
> > > > > In short, all endpoints data are stored in ETCD and they are
> renewed
> > > > based
> > > > > on keep-alive and lease.
> > > > > Each components are designed to access ETCD when the failure
> detected
> > > and
> > > > > connect to a new(failed-over) scheduler.
> > > > >
> > > > > I hope it is useful to you.
> > > > > And I think when I and my colleagues open PRs, we need to add more
> > > detail
> > > > > design along with them.
> > > > >
> > > > > If you have any further questions, kindly let me know.
> > > > >
> > > > > Thanks
> > > > > Best regards
> > > > > Dominic
> > > > >
> > > > >
> > > > >
> > > > > 2019년 4월 6일 (토) 오전 11:28, Mingyu Zhou <zh...@gmail.com>님이 작성:
> > > > >
> > > > > > Dear Dominic,
> > > > > >
> > > > > > Thanks for your proposal. It is very inspirational and it looks
> > > > > promising.
> > > > > >
> > > > > > I'd like to ask some questions about the fall over/failure
> recovery
> > > > > > mechanism of the scheduler component.
> > > > > >
> > > > > > IIUC, a scheduler instance hosts multiple queue managers. If a
> > > > scheduler
> > > > > is
> > > > > > down, we will lose multiple queue managers. Thus, there should be
> > > some
> > > > > form
> > > > > > of failure recovery of queue managers and it raises the following
> > > > > > questions:
> > > > > >
> > > > > > 1. In your proposal, how the failure of a scheduler is detected?
> > > I.e.,
> > > > > > when a scheduler instance is down and some queue manager become
> > > > > > unreachable, which component will be aware of this unavailability
> > and
> > > > > then
> > > > > > trigger the recovery procedure?
> > > > > >
> > > > > > 2. How should the failure be recovered and lost queue managers be
> > > > brought
> > > > > > back to life? Specifically, in your proposal, you designed a hot
> > > > > > standing-by pairing of queue managers (one leader/two followers).
> > > Then
> > > > > how
> > > > > > should the new leader be selected in face of scheduler crash? And
> > do
> > > we
> > > > > > need to allocate a new queue manager to maintain the
> > > > > > one-leader-two-follower configuration?
> > > > > >
> > > > > > 3. How will the other components in the system learn the new
> > > > > configuration
> > > > > > after a fall over? For example, how will the pool balancer
> discover
> > > the
> > > > > new
> > > > > > state of the scheduler it managers and change its policy to
> > > distribute
> > > > > > queue creation requests?
> > > > > >
> > > > > > Thanks
> > > > > > Mingyu Zhou
> > > > > >
> > > > > > On Fri, Apr 5, 2019 at 2:56 PM Dominic Kim <st...@gmail.com>
> > > > wrote:
> > > > > >
> > > > > > > Dear David, Matt, and Dascalita.
> > > > > > > Thank you for your interest in my proposal.
> > > > > > >
> > > > > > > Let me answer your questions one by one.
> > > > > > >
> > > > > > > @David
> > > > > > > Yes, I will(and actually already did) implement all components
> > > based
> > > > on
> > > > > > > SPI.
> > > > > > > The reason why I said "breaking changes" is, my proposal will
> > > affect
> > > > > most
> > > > > > > of components drastically.
> > > > > > > For example, InvokerReactive will become a SPI and current
> > > > > > InvokerReactive
> > > > > > > will become one of its concrete implementation.
> > > > > > > My load balancer and throttler are also based on the current
> SPI.
> > > > > > > So though my implementation would be included in OpenWhisk,
> > > > downstreams
> > > > > > > still can take advantage of existing implementation such as
> > > > > > > ShardingPoolBalancer.
> > > > > > >
> > > > > > > Regarding Leader/Follower, a fair point.
> > > > > > > The reason why I introduced such a model is to prepare for the
> > > future
> > > > > > > enhancement.
> > > > > > > Actually, I reached a conclusion that memory based activation
> > > passing
> > > > > > would
> > > > > > > be enough for OpenWhisk in terms of message persistence.
> > > > > > > But it is just my own opinion and community may want more rigid
> > > level
> > > > > of
> > > > > > > persistence.
> > > > > > > I naively thought we can add replication and HA logic in the
> > queue
> > > > > which
> > > > > > > are similar to the one in Kafka.
> > > > > > > And Leader/Follower would be a good base building block for
> this.
> > > > > > >
> > > > > > > Currently only a leader fetch activation messages from Kafka.
> > > > Followers
> > > > > > > will be idle while watching the leadership change.
> > > > > > > Once the leadership is changed, one of followers will become a
> > new
> > > > > leader
> > > > > > > and at that time, Kafka consumer for the new leader will be
> > > created.
> > > > > > > This is to minimize the failure handling time in the aspect of
> > > > clients
> > > > > as
> > > > > > > you mentioned. It is also correct that this flow does not
> prevent
> > > > > > > activation messages lost on scheduler failure.
> > > > > > > But it's not that complex as activation messages are not
> > replicated
> > > > to
> > > > > > > followers and the number of followers are configurable.
> > > > > > > If we want, we can configure the number of required queue to 1,
> > > there
> > > > > > will
> > > > > > > be only one leader queue.
> > > > > > > (If we ok with the current level of persistence, I think we may
> > not
> > > > > need
> > > > > > > more than 1 queue as you said.)
> > > > > > >
> > > > > > > Regarding pulling activation messages, each action will have
> its
> > > own
> > > > > > Kafka
> > > > > > > topic.
> > > > > > > It is same with what I proposed in my previous proposals.
> > > > > > > When an action is created, a Kafka topic for the action will be
> > > > > created.
> > > > > > > So each leader queue(consumer) will fetch activation messages
> > from
> > > > its
> > > > > > own
> > > > > > > Kafka topic and there would be no intervention among actions.
> > > > > > >
> > > > > > > When I and my colleagues open PRs for each component, we will
> add
> > > > > detail
> > > > > > > component design.
> > > > > > > It would help you guys understand the proposal more.
> > > > > > >
> > > > > > > @Matt
> > > > > > > Thank you for the suggestion.
> > > > > > > If I change the name of it now, it would break the link in this
> > > > thread.
> > > > > > > I would use the name you suggested when I open a PR.
> > > > > > >
> > > > > > >
> > > > > > > @Dascalita
> > > > > > >
> > > > > > > Interesting idea.
> > > > > > > Any GC patterns do you keep in your mind to apply in OpenWhisk?
> > > > > > >
> > > > > > > In my proposal, the container GC is similar to what OpenWhisk
> > does
> > > > > these
> > > > > > > days.
> > > > > > > Each container will autonomously fetch activations from the
> > queue.
> > > > > > > Whenever they finish invocation of one activation, they will
> > fetch
> > > > the
> > > > > > next
> > > > > > > request and invoke it.
> > > > > > > In this sense, we can maximize the container reuse.
> > > > > > >
> > > > > > > When there is no more activation message, ContainerProxy will
> be
> > > wait
> > > > > for
> > > > > > > the given time(configurable) and just stop.
> > > > > > > One difference is containers are no more paused, they are just
> > > > removed.
> > > > > > > Instead of pausing them, containers are waiting for subsequent
> > > > requests
> > > > > > for
> > > > > > > longer time(5~10s) than current implementation.
> > > > > > > This is because pausing is also relatively expensive operation
> in
> > > > > > > comparison to short-running invocation.
> > > > > > >
> > > > > > > Container lifecycle is managed in this way.
> > > > > > > 1. When a container is created, it will add its information in
> > > ETCD.
> > > > > > > 2. A queue will count the existing number of containers using
> > above
> > > > > > > information.
> > > > > > > 3. Under heavy loads, the queue will request more containers if
> > the
> > > > > > number
> > > > > > > of existing containers is less than its resource limit.
> > > > > > > 4. When the container is removed, it will delete its
> information
> > in
> > > > > ETCD.
> > > > > > >
> > > > > > >
> > > > > > > Again, I really appreciate all your feedbacks and questions.
> > > > > > > If you have any further questions, kindly let me know.
> > > > > > >
> > > > > > > Best regards
> > > > > > > Dominic
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > 2019년 4월 5일 (금) 오전 1:24, Dascalita Dragos <ddragosd@gmail.com
> >님이
> > > 작성:
> > > > > > >
> > > > > > > > Hi Dominic,
> > > > > > > > Thanks for sharing your ideas. IIUC (and pls keep me honest),
> > the
> > > > > goal
> > > > > > of
> > > > > > > > the new design is to improve activation performance. I
> > personally
> > > > > love
> > > > > > > > this; performance is a critical non-functional feature of any
> > > FaaS
> > > > > > > system.
> > > > > > > >
> > > > > > > > There’s something I’d like to call out: the management of
> > > > containers
> > > > > > in a
> > > > > > > > FaaS system could be compared to a JVM. A JVM allocates
> objects
> > > in
> > > > > > > memory,
> > > > > > > > and GC them. A FaaS system allocates containers to run
> actions,
> > > and
> > > > > it
> > > > > > > GCs
> > > > > > > > them when they become idle. If we could look at OW's
> scheduling
> > > > from
> > > > > > this
> > > > > > > > perspective, we could reuse the proven patterns in the JVM vs
> > > > > inventing
> > > > > > > > something new. I’d be interested on any GC implications in
> the
> > > new
> > > > > > > design,
> > > > > > > > meaning how idle actions get removed, and how is that
> > > orchestrated.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > dragos
> > > > > > > >
> > > > > > > >
> > > > > > > > On Thu, Apr 4, 2019 at 8:40 AM Matt Sicker <boards@gmail.com
> >
> > > > wrote:
> > > > > > > >
> > > > > > > > > Would it make sense to define an OpenWhisk
> > > > Improvement/Enhancement
> > > > > > > > > Propoposal or similar that various other Apache projects
> do?
> > We
> > > > > could
> > > > > > > > > call them WHIPs or something. :)
> > > > > > > > >
> > > > > > > > > On Thu, 4 Apr 2019 at 09:09, David P Grove <
> > groved@us.ibm.com>
> > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Dominic Kim <st...@gmail.com> wrote on 04/04/2019
> > > 04:37:19
> > > > > AM:
> > > > > > > > > > >
> > > > > > > > > > > I have proposed a new architecture.
> > > > > > > > > > >
> > > > > > > >
> > > > >
> > https://cwiki.apache.org/confluence/display/OPENWHISK/New+architecture
> > > > > > > > > > +proposal
> > > > > > > > > > >
> > > > > > > > > > > It includes many controversial agendas and breaking
> > > changes.
> > > > > > > > > > > So I would like to form a general consensus on them.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Hi Dominic,
> > > > > > > > > >
> > > > > > > > > >         There's much to like about the proposal.  Thank
> you
> > > for
> > > > > > > writing
> > > > > > > > > it
> > > > > > > > > > up.
> > > > > > > > > >
> > > > > > > > > >         One meta-comment is that the work will have to be
> > > done
> > > > > in a
> > > > > > > way
> > > > > > > > > so
> > > > > > > > > > there are no actual "breaking changes".  It has to be
> > > possible
> > > > to
> > > > > > > > > continue
> > > > > > > > > > to configure the system using the existing architectures
> > > while
> > > > > this
> > > > > > > > work
> > > > > > > > > > proceeds.  I would expect this could be done via a new
> > > > > LoadBalancer
> > > > > > > and
> > > > > > > > > > some deployment options (similar to how Lean OpenWhisk
> was
> > > > done).
> > > > > > If
> > > > > > > > > work
> > > > > > > > > > needs to be done to generalize the LoadBalancer SPI, that
> > > could
> > > > > be
> > > > > > > done
> > > > > > > > > > early in the process.
> > > > > > > > > >
> > > > > > > > > >         On the proposal itself, I wonder if the
> complexity
> > of
> > > > > > > > > Leader/Follower
> > > > > > > > > > is actually needed?  If a Scheduler crashes, it could be
> > > > > restarted
> > > > > > > and
> > > > > > > > > then
> > > > > > > > > > resume handling its assigned load.  I think there should
> be
> > > > > enough
> > > > > > > > > > information in etcd for it to recover its current set of
> > > > assigned
> > > > > > > > > > ContainerProxys and carry on.   Activations in its in
> > memory
> > > > > queues
> > > > > > > > would
> > > > > > > > > > be lost (bigger blast radius than the current
> > architecture),
> > > > but
> > > > > I
> > > > > > > > don't
> > > > > > > > > > see that the Leader/Follower changes that (seems way too
> > > > > expensive
> > > > > > to
> > > > > > > > be
> > > > > > > > > > replicating every activation in the Follower Queues).
>  The
> > > > > > > > > Leader/Follower
> > > > > > > > > > would allow for shorter downtime for those actions
> assigned
> > > to
> > > > > the
> > > > > > > > downed
> > > > > > > > > > Scheduler, but at the cost of significant complexity.  Is
> > it
> > > > > worth
> > > > > > > it?
> > > > > > > > > >
> > > > > > > > > >         Perhaps related to the Leader/Follower, its not
> > clear
> > > > to
> > > > > me
> > > > > > > how
> > > > > > > > > > activation messages are being pulled from the action
> topic
> > in
> > > > > Kafka
> > > > > > > > > during
> > > > > > > > > > the Queue creation window. I think they have to go
> > somewhere
> > > > > > (because
> > > > > > > > the
> > > > > > > > > > is a mix of actions on a single Kafka topic and we can't
> > > stall
> > > > > > other
> > > > > > > > > > actions while waiting for a Queue to be created for a new
> > > > > action),
> > > > > > > but
> > > > > > > > if
> > > > > > > > > > you don't know yet which Scheduler is going to win the
> race
> > > to
> > > > > be a
> > > > > > > > > Leader
> > > > > > > > > > how do you know where to put them?
> > > > > > > > > >
> > > > > > > > > > --dave
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Matt Sicker <bo...@gmail.com>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > 周明宇
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: New architecture proposal

Posted by Dominic Kim <st...@gmail.com>.
Well, that is not the tradeoff only resides in my proposal.
If we "must" guarantee the execution of users, we need to have a bigger
cluster than the sum of users' limit.
Even though we use current implementation, if the sum of concurrent limit
exceeds the system resources, we cannot completely guarantee all executions
under a burst.
(We cannot guarantee 200 concurrent execution with 100 containers.)

The reason why I used a kind of self-GC is, it's not easy to track the
resource status in real time.
It is the same with the reason why I take the pull-based model.

For example, if we control the container deletion in a central way, we
should track all container status such as how many containers are running
for each action, which of them are idle, where they are running, and so on.
But the status of resources changes blazingly fast.
Below is one example. The execution is over within 2 ms.
{
    "namespace": "style95",
    "name": "hello-world",
    "version": "0.0.1",
    "subject": "style95",
    "activationId": "ba2cc561fc8e4272acc561fc8ea27210",
    "start": 1554967214351,
    "end": 1554967214353,
    "duration": 2,
    "response": {
        "status": "success",
        "statusCode": 0,
.
.
.
}

It means the container status(busy, warm) can change every 2 ms.

Also, if you are not thinking of one central component which decides to
delete containers,(it can be a SPOF) there will be multiple components
which are making the decision.
When one of them makes a decision, it should consider decisions made(and
will be made) by the others at the same time.
So we need to track down all container status, consider all decision made
by multiple components and finally make the optimal decision to delete
containers within 2 ms.
I think this is not viable.

Your idea sounds great, it could be a great enhancement.
And I think GC implementation is orthogonal to my proposal.
So if you can deal with such problems, I would gladly apply it as well.


Best regards
Dominic






2019년 4월 11일 (목) 오후 3:15, Dascalita Dragos <dd...@gmail.com>님이 작성:

> Thanks Dominic for the details.
>
> It seems like an operator has to choose between “do I hurt performance(low
> timeout) or do I hurt the SLA” ?
>
> If this is the trade off , isn’t this a hard choice to make ? So I’m
> wondering whether some alternative designs could be used for this problem.
>
> The key decision here is: should OW be given a cluster wide power to view
> and control the resources or not. IIUC the current proposal doesn’t support
> this? I’m not saying the proposed model is not good; I’d just feel more
> comfortable if OW would allow more options instead of one, in the same way
> the JVM allows multiple GC implementations. In the proposed model the GC
> would offload the decision to each container, while other implementations
> may do it differently. For instance,  I’d implement something dynamic that
> adapts the timeout to the load, and maybe try some predictive ML algorithms
> to manage resources - if a model suggests that out of 3 actions that could
> be removed, 1 has a higher probability to be invoked again, wouldn’t it be
> more efficient to remove one of the other 2 ? Such a logic can only be
> achieved through an entity with a cluster wide view, as actions don’t know
> about each other, to negotiate a dynamic timeout.
>
> - dragos
>
> On Wed, Apr 10, 2019 at 3:46 AM Dominic Kim <st...@gmail.com> wrote:
>
> > Dear Dascalita
> >
> > That depends on the timeout configuration.
> > For example, if you need something similar to the one in the current code
> > base, you can just configure the timeout to a small enough value, such as
> > 50ms.
> >
> > The idea behind the longer timeout is, it shows better performance when
> > there are highly likely subsequent requests.
> > For example, it takes about 100ms ~ 1s to create a new coldstart
> container.
> > If the action execution takes 10ms, it should wait 10 to 100 times more
> for
> > a new container.
> > In this case, it is reasonable to wait for the previous execution and
> reuse
> > the existing container rather than creating a new container.
> > So 100ms ~ 1s could be good options for the timeout value.
> > (Under heavy loads, I even observed it took 2s ~ 5s to create a coldstart
> > container.)
> > And this implies some changes in the notion of resources.
> >
> > In the cluster, there would be a different kind of requests.
> > There would be both batch and real-time invocation.
> > So I think this is a tradeoff.
> > Longer timeout will increase the reuse rate of containers but idle
> > containers will possess resources longer.
> >
> > And even in the current implementation, subsequent invocation should wait
> > for some time to remove existing(warmed containers) and create a new cold
> > start container.
> > As I said, it could take up to few seconds under heavy loads.
> > With reasonable timeout value, there would be no big performance
> difference
> > in the above situation.
> > (Actually, I expect new scheduler would outperform even with 5~10s
> timeout
> > value as it will evenly distribute docker operation.
> > In the current implementation, all execution is sent to the home invoker
> > first and it could make the situation worse in edge cases.
> > I hope I can share performance comparison results as I am doing
> > benchmarking.)
> >
> > Also, I think the above case is an edge case that someone is consuming
> most
> > of the cluster resources and executing two different batch invocation
> > alternatively.
> > Anyway, we can support such an edge case with some shutdown period.
> > This can be controversial, but I believe this is a viable option.
> >
> >
> > If you said that in the view of OpenWhisk operator, I think you meant
> there
> > are more than 1 heavy users.
> > Let's say, one user has 60 containers limit and the other has 80
> containers
> > limit.
> > Then can we guarantee both users' execution without any issue in current
> > implementation?
> > If their invocation requests come together, one or both of their
> invocation
> > will be heavily delayed.
> >
> > So I think when we(operators) notice there are such heavy users, we
> should
> > scale out our clusters to guarantee their invocation or we should reduce
> > their resource limit.
> > This is also a tradeoff. If we must guarantee their invocation, we at
> least
> > need a bigger cluster than the sum of their throttling limit.
> > If we have weak SLA, we can support both users with smaller cluster
> though
> > their invocation can be delayed a bit.
> >
> >
> > In short, if you prefer the current behavior you can still have a similar
> > effect by configuring the timeout as 50ms.
> > (So containers will only wait for 50ms, though it may induce some
> > performance degradation in other cases.)
> >
> > Best regards
> > Dominic
> >
> >
> > 2019년 4월 10일 (수) 오전 1:36, Dascalita Dragos <dd...@gmail.com>님이 작성:
> >
> > > "...When there is no more activation message, ContainerProxy will be
> wait
> > > for the given time(configurable) and just stop...."
> > >
> > > How does the system allocate and de-allocate resources when it's
> > congested
> > > ?
> > > I'm thinking at the use case where the system receives a batch of
> > > activations that require 60% of all cluster resources. Once those
> > > activations finish, a different batch of activations are received, and
> > this
> > > time the new batch requires new actions to be cold-started; these new
> > > activations require a total of 80% of the overall cluster resources.
> > Unless
> > > the previous actions are removed, the cluster is over-allocated. In the
> > > current model would the cluster process 1/2 of the new activations b/c
> it
> > > needs to wait for the previous actions to stop by themselves ?
> > >
> > > On Sun, Apr 7, 2019 at 7:34 PM Dominic Kim <st...@gmail.com>
> wrote:
> > >
> > > > Hi Mingyu
> > > >
> > > > Thank you for the good questions.
> > > >
> > > > Before answering to your question, I will share the Lease in ETCD
> > first.
> > > > ETCD has a data model which is disappear after given time if there is
> > no
> > > > relevant keepalive on it, the Lease.
> > > >
> > > > So once you grant a new lease, you can put it with data in each
> > operation
> > > > such as put, putTxn(transaction), etc.
> > > > If there is no keep-alive for the given(configurable) time, inserted
> > data
> > > > will be gone.
> > > >
> > > > In my proposal, most of data in ETCD rely on a lease.
> > > > For example, each scheduler stores their endpoint information(for
> queue
> > > > creation) with a lease. Each queue stores their information(for
> > > activation)
> > > > in ETCD with a lease.
> > > > (It is overhead to do keep-alive in each memory queue separately, I
> > > > introduced EtcdKeepAliveService to share one global lease among all
> > > queues
> > > > in a same scheduler.)
> > > > Each ContainerProxy store their information in ETCD with a lease so
> > that
> > > > when a queue tries to create a container, they can easily count the
> > > number
> > > > of existing containers with "Count" API.
> > > > Both data are stored with a lease, if one scheduler or invoker are
> > > failed,
> > > > keep-alive for the given lease is not continued, and finally those
> data
> > > > will be removed.
> > > >
> > > > Follower queues are watching on the leader queue information. If
> there
> > > are
> > > > any changes,(the data will be removed upon scheduler failure) they
> can
> > > > receive the notification and start new leader election.
> > > > When a scheduler is failed, ContainerProxys which were communicating
> > > with a
> > > > queue in that scheduler, will receive a connection error.
> > > > Then they are designed to access to the ETCD again to figure out the
> > > > endpoint of the leader queue.
> > > > As one of followers becomes a new leader, ContainerProxys can connect
> > to
> > > > the new leader.
> > > >
> > > > One thing to note here is, there is only one QueueManager in each
> > > > scheduler.
> > > > One QueueManager holds all queues and delegate requests to the proper
> > > queue
> > > > in respond to "fetch" requests.
> > > >
> > > > In short, all endpoints data are stored in ETCD and they are renewed
> > > based
> > > > on keep-alive and lease.
> > > > Each components are designed to access ETCD when the failure detected
> > and
> > > > connect to a new(failed-over) scheduler.
> > > >
> > > > I hope it is useful to you.
> > > > And I think when I and my colleagues open PRs, we need to add more
> > detail
> > > > design along with them.
> > > >
> > > > If you have any further questions, kindly let me know.
> > > >
> > > > Thanks
> > > > Best regards
> > > > Dominic
> > > >
> > > >
> > > >
> > > > 2019년 4월 6일 (토) 오전 11:28, Mingyu Zhou <zh...@gmail.com>님이 작성:
> > > >
> > > > > Dear Dominic,
> > > > >
> > > > > Thanks for your proposal. It is very inspirational and it looks
> > > > promising.
> > > > >
> > > > > I'd like to ask some questions about the fall over/failure recovery
> > > > > mechanism of the scheduler component.
> > > > >
> > > > > IIUC, a scheduler instance hosts multiple queue managers. If a
> > > scheduler
> > > > is
> > > > > down, we will lose multiple queue managers. Thus, there should be
> > some
> > > > form
> > > > > of failure recovery of queue managers and it raises the following
> > > > > questions:
> > > > >
> > > > > 1. In your proposal, how the failure of a scheduler is detected?
> > I.e.,
> > > > > when a scheduler instance is down and some queue manager become
> > > > > unreachable, which component will be aware of this unavailability
> and
> > > > then
> > > > > trigger the recovery procedure?
> > > > >
> > > > > 2. How should the failure be recovered and lost queue managers be
> > > brought
> > > > > back to life? Specifically, in your proposal, you designed a hot
> > > > > standing-by pairing of queue managers (one leader/two followers).
> > Then
> > > > how
> > > > > should the new leader be selected in face of scheduler crash? And
> do
> > we
> > > > > need to allocate a new queue manager to maintain the
> > > > > one-leader-two-follower configuration?
> > > > >
> > > > > 3. How will the other components in the system learn the new
> > > > configuration
> > > > > after a fall over? For example, how will the pool balancer discover
> > the
> > > > new
> > > > > state of the scheduler it managers and change its policy to
> > distribute
> > > > > queue creation requests?
> > > > >
> > > > > Thanks
> > > > > Mingyu Zhou
> > > > >
> > > > > On Fri, Apr 5, 2019 at 2:56 PM Dominic Kim <st...@gmail.com>
> > > wrote:
> > > > >
> > > > > > Dear David, Matt, and Dascalita.
> > > > > > Thank you for your interest in my proposal.
> > > > > >
> > > > > > Let me answer your questions one by one.
> > > > > >
> > > > > > @David
> > > > > > Yes, I will(and actually already did) implement all components
> > based
> > > on
> > > > > > SPI.
> > > > > > The reason why I said "breaking changes" is, my proposal will
> > affect
> > > > most
> > > > > > of components drastically.
> > > > > > For example, InvokerReactive will become a SPI and current
> > > > > InvokerReactive
> > > > > > will become one of its concrete implementation.
> > > > > > My load balancer and throttler are also based on the current SPI.
> > > > > > So though my implementation would be included in OpenWhisk,
> > > downstreams
> > > > > > still can take advantage of existing implementation such as
> > > > > > ShardingPoolBalancer.
> > > > > >
> > > > > > Regarding Leader/Follower, a fair point.
> > > > > > The reason why I introduced such a model is to prepare for the
> > future
> > > > > > enhancement.
> > > > > > Actually, I reached a conclusion that memory based activation
> > passing
> > > > > would
> > > > > > be enough for OpenWhisk in terms of message persistence.
> > > > > > But it is just my own opinion and community may want more rigid
> > level
> > > > of
> > > > > > persistence.
> > > > > > I naively thought we can add replication and HA logic in the
> queue
> > > > which
> > > > > > are similar to the one in Kafka.
> > > > > > And Leader/Follower would be a good base building block for this.
> > > > > >
> > > > > > Currently only a leader fetch activation messages from Kafka.
> > > Followers
> > > > > > will be idle while watching the leadership change.
> > > > > > Once the leadership is changed, one of followers will become a
> new
> > > > leader
> > > > > > and at that time, Kafka consumer for the new leader will be
> > created.
> > > > > > This is to minimize the failure handling time in the aspect of
> > > clients
> > > > as
> > > > > > you mentioned. It is also correct that this flow does not prevent
> > > > > > activation messages lost on scheduler failure.
> > > > > > But it's not that complex as activation messages are not
> replicated
> > > to
> > > > > > followers and the number of followers are configurable.
> > > > > > If we want, we can configure the number of required queue to 1,
> > there
> > > > > will
> > > > > > be only one leader queue.
> > > > > > (If we ok with the current level of persistence, I think we may
> not
> > > > need
> > > > > > more than 1 queue as you said.)
> > > > > >
> > > > > > Regarding pulling activation messages, each action will have its
> > own
> > > > > Kafka
> > > > > > topic.
> > > > > > It is same with what I proposed in my previous proposals.
> > > > > > When an action is created, a Kafka topic for the action will be
> > > > created.
> > > > > > So each leader queue(consumer) will fetch activation messages
> from
> > > its
> > > > > own
> > > > > > Kafka topic and there would be no intervention among actions.
> > > > > >
> > > > > > When I and my colleagues open PRs for each component, we will add
> > > > detail
> > > > > > component design.
> > > > > > It would help you guys understand the proposal more.
> > > > > >
> > > > > > @Matt
> > > > > > Thank you for the suggestion.
> > > > > > If I change the name of it now, it would break the link in this
> > > thread.
> > > > > > I would use the name you suggested when I open a PR.
> > > > > >
> > > > > >
> > > > > > @Dascalita
> > > > > >
> > > > > > Interesting idea.
> > > > > > Any GC patterns do you keep in your mind to apply in OpenWhisk?
> > > > > >
> > > > > > In my proposal, the container GC is similar to what OpenWhisk
> does
> > > > these
> > > > > > days.
> > > > > > Each container will autonomously fetch activations from the
> queue.
> > > > > > Whenever they finish invocation of one activation, they will
> fetch
> > > the
> > > > > next
> > > > > > request and invoke it.
> > > > > > In this sense, we can maximize the container reuse.
> > > > > >
> > > > > > When there is no more activation message, ContainerProxy will be
> > wait
> > > > for
> > > > > > the given time(configurable) and just stop.
> > > > > > One difference is containers are no more paused, they are just
> > > removed.
> > > > > > Instead of pausing them, containers are waiting for subsequent
> > > requests
> > > > > for
> > > > > > longer time(5~10s) than current implementation.
> > > > > > This is because pausing is also relatively expensive operation in
> > > > > > comparison to short-running invocation.
> > > > > >
> > > > > > Container lifecycle is managed in this way.
> > > > > > 1. When a container is created, it will add its information in
> > ETCD.
> > > > > > 2. A queue will count the existing number of containers using
> above
> > > > > > information.
> > > > > > 3. Under heavy loads, the queue will request more containers if
> the
> > > > > number
> > > > > > of existing containers is less than its resource limit.
> > > > > > 4. When the container is removed, it will delete its information
> in
> > > > ETCD.
> > > > > >
> > > > > >
> > > > > > Again, I really appreciate all your feedbacks and questions.
> > > > > > If you have any further questions, kindly let me know.
> > > > > >
> > > > > > Best regards
> > > > > > Dominic
> > > > > >
> > > > > >
> > > > > >
> > > > > > 2019년 4월 5일 (금) 오전 1:24, Dascalita Dragos <dd...@gmail.com>님이
> > 작성:
> > > > > >
> > > > > > > Hi Dominic,
> > > > > > > Thanks for sharing your ideas. IIUC (and pls keep me honest),
> the
> > > > goal
> > > > > of
> > > > > > > the new design is to improve activation performance. I
> personally
> > > > love
> > > > > > > this; performance is a critical non-functional feature of any
> > FaaS
> > > > > > system.
> > > > > > >
> > > > > > > There’s something I’d like to call out: the management of
> > > containers
> > > > > in a
> > > > > > > FaaS system could be compared to a JVM. A JVM allocates objects
> > in
> > > > > > memory,
> > > > > > > and GC them. A FaaS system allocates containers to run actions,
> > and
> > > > it
> > > > > > GCs
> > > > > > > them when they become idle. If we could look at OW's scheduling
> > > from
> > > > > this
> > > > > > > perspective, we could reuse the proven patterns in the JVM vs
> > > > inventing
> > > > > > > something new. I’d be interested on any GC implications in the
> > new
> > > > > > design,
> > > > > > > meaning how idle actions get removed, and how is that
> > orchestrated.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > dragos
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Apr 4, 2019 at 8:40 AM Matt Sicker <bo...@gmail.com>
> > > wrote:
> > > > > > >
> > > > > > > > Would it make sense to define an OpenWhisk
> > > Improvement/Enhancement
> > > > > > > > Propoposal or similar that various other Apache projects do?
> We
> > > > could
> > > > > > > > call them WHIPs or something. :)
> > > > > > > >
> > > > > > > > On Thu, 4 Apr 2019 at 09:09, David P Grove <
> groved@us.ibm.com>
> > > > > wrote:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Dominic Kim <st...@gmail.com> wrote on 04/04/2019
> > 04:37:19
> > > > AM:
> > > > > > > > > >
> > > > > > > > > > I have proposed a new architecture.
> > > > > > > > > >
> > > > > > >
> > > >
> https://cwiki.apache.org/confluence/display/OPENWHISK/New+architecture
> > > > > > > > > +proposal
> > > > > > > > > >
> > > > > > > > > > It includes many controversial agendas and breaking
> > changes.
> > > > > > > > > > So I would like to form a general consensus on them.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Hi Dominic,
> > > > > > > > >
> > > > > > > > >         There's much to like about the proposal.  Thank you
> > for
> > > > > > writing
> > > > > > > > it
> > > > > > > > > up.
> > > > > > > > >
> > > > > > > > >         One meta-comment is that the work will have to be
> > done
> > > > in a
> > > > > > way
> > > > > > > > so
> > > > > > > > > there are no actual "breaking changes".  It has to be
> > possible
> > > to
> > > > > > > > continue
> > > > > > > > > to configure the system using the existing architectures
> > while
> > > > this
> > > > > > > work
> > > > > > > > > proceeds.  I would expect this could be done via a new
> > > > LoadBalancer
> > > > > > and
> > > > > > > > > some deployment options (similar to how Lean OpenWhisk was
> > > done).
> > > > > If
> > > > > > > > work
> > > > > > > > > needs to be done to generalize the LoadBalancer SPI, that
> > could
> > > > be
> > > > > > done
> > > > > > > > > early in the process.
> > > > > > > > >
> > > > > > > > >         On the proposal itself, I wonder if the complexity
> of
> > > > > > > > Leader/Follower
> > > > > > > > > is actually needed?  If a Scheduler crashes, it could be
> > > > restarted
> > > > > > and
> > > > > > > > then
> > > > > > > > > resume handling its assigned load.  I think there should be
> > > > enough
> > > > > > > > > information in etcd for it to recover its current set of
> > > assigned
> > > > > > > > > ContainerProxys and carry on.   Activations in its in
> memory
> > > > queues
> > > > > > > would
> > > > > > > > > be lost (bigger blast radius than the current
> architecture),
> > > but
> > > > I
> > > > > > > don't
> > > > > > > > > see that the Leader/Follower changes that (seems way too
> > > > expensive
> > > > > to
> > > > > > > be
> > > > > > > > > replicating every activation in the Follower Queues).   The
> > > > > > > > Leader/Follower
> > > > > > > > > would allow for shorter downtime for those actions assigned
> > to
> > > > the
> > > > > > > downed
> > > > > > > > > Scheduler, but at the cost of significant complexity.  Is
> it
> > > > worth
> > > > > > it?
> > > > > > > > >
> > > > > > > > >         Perhaps related to the Leader/Follower, its not
> clear
> > > to
> > > > me
> > > > > > how
> > > > > > > > > activation messages are being pulled from the action topic
> in
> > > > Kafka
> > > > > > > > during
> > > > > > > > > the Queue creation window. I think they have to go
> somewhere
> > > > > (because
> > > > > > > the
> > > > > > > > > is a mix of actions on a single Kafka topic and we can't
> > stall
> > > > > other
> > > > > > > > > actions while waiting for a Queue to be created for a new
> > > > action),
> > > > > > but
> > > > > > > if
> > > > > > > > > you don't know yet which Scheduler is going to win the race
> > to
> > > > be a
> > > > > > > > Leader
> > > > > > > > > how do you know where to put them?
> > > > > > > > >
> > > > > > > > > --dave
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Matt Sicker <bo...@gmail.com>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > 周明宇
> > > > >
> > > >
> > >
> >
>

Re: New architecture proposal

Posted by Dascalita Dragos <dd...@gmail.com>.
Thanks Dominic for the details.

It seems like an operator has to choose between “do I hurt performance(low
timeout) or do I hurt the SLA” ?

If this is the trade off , isn’t this a hard choice to make ? So I’m
wondering whether some alternative designs could be used for this problem.

The key decision here is: should OW be given a cluster wide power to view
and control the resources or not. IIUC the current proposal doesn’t support
this? I’m not saying the proposed model is not good; I’d just feel more
comfortable if OW would allow more options instead of one, in the same way
the JVM allows multiple GC implementations. In the proposed model the GC
would offload the decision to each container, while other implementations
may do it differently. For instance,  I’d implement something dynamic that
adapts the timeout to the load, and maybe try some predictive ML algorithms
to manage resources - if a model suggests that out of 3 actions that could
be removed, 1 has a higher probability to be invoked again, wouldn’t it be
more efficient to remove one of the other 2 ? Such a logic can only be
achieved through an entity with a cluster wide view, as actions don’t know
about each other, to negotiate a dynamic timeout.

- dragos

On Wed, Apr 10, 2019 at 3:46 AM Dominic Kim <st...@gmail.com> wrote:

> Dear Dascalita
>
> That depends on the timeout configuration.
> For example, if you need something similar to the one in the current code
> base, you can just configure the timeout to a small enough value, such as
> 50ms.
>
> The idea behind the longer timeout is, it shows better performance when
> there are highly likely subsequent requests.
> For example, it takes about 100ms ~ 1s to create a new coldstart container.
> If the action execution takes 10ms, it should wait 10 to 100 times more for
> a new container.
> In this case, it is reasonable to wait for the previous execution and reuse
> the existing container rather than creating a new container.
> So 100ms ~ 1s could be good options for the timeout value.
> (Under heavy loads, I even observed it took 2s ~ 5s to create a coldstart
> container.)
> And this implies some changes in the notion of resources.
>
> In the cluster, there would be a different kind of requests.
> There would be both batch and real-time invocation.
> So I think this is a tradeoff.
> Longer timeout will increase the reuse rate of containers but idle
> containers will possess resources longer.
>
> And even in the current implementation, subsequent invocation should wait
> for some time to remove existing(warmed containers) and create a new cold
> start container.
> As I said, it could take up to few seconds under heavy loads.
> With reasonable timeout value, there would be no big performance difference
> in the above situation.
> (Actually, I expect new scheduler would outperform even with 5~10s timeout
> value as it will evenly distribute docker operation.
> In the current implementation, all execution is sent to the home invoker
> first and it could make the situation worse in edge cases.
> I hope I can share performance comparison results as I am doing
> benchmarking.)
>
> Also, I think the above case is an edge case that someone is consuming most
> of the cluster resources and executing two different batch invocation
> alternatively.
> Anyway, we can support such an edge case with some shutdown period.
> This can be controversial, but I believe this is a viable option.
>
>
> If you said that in the view of OpenWhisk operator, I think you meant there
> are more than 1 heavy users.
> Let's say, one user has 60 containers limit and the other has 80 containers
> limit.
> Then can we guarantee both users' execution without any issue in current
> implementation?
> If their invocation requests come together, one or both of their invocation
> will be heavily delayed.
>
> So I think when we(operators) notice there are such heavy users, we should
> scale out our clusters to guarantee their invocation or we should reduce
> their resource limit.
> This is also a tradeoff. If we must guarantee their invocation, we at least
> need a bigger cluster than the sum of their throttling limit.
> If we have weak SLA, we can support both users with smaller cluster though
> their invocation can be delayed a bit.
>
>
> In short, if you prefer the current behavior you can still have a similar
> effect by configuring the timeout as 50ms.
> (So containers will only wait for 50ms, though it may induce some
> performance degradation in other cases.)
>
> Best regards
> Dominic
>
>
> 2019년 4월 10일 (수) 오전 1:36, Dascalita Dragos <dd...@gmail.com>님이 작성:
>
> > "...When there is no more activation message, ContainerProxy will be wait
> > for the given time(configurable) and just stop...."
> >
> > How does the system allocate and de-allocate resources when it's
> congested
> > ?
> > I'm thinking at the use case where the system receives a batch of
> > activations that require 60% of all cluster resources. Once those
> > activations finish, a different batch of activations are received, and
> this
> > time the new batch requires new actions to be cold-started; these new
> > activations require a total of 80% of the overall cluster resources.
> Unless
> > the previous actions are removed, the cluster is over-allocated. In the
> > current model would the cluster process 1/2 of the new activations b/c it
> > needs to wait for the previous actions to stop by themselves ?
> >
> > On Sun, Apr 7, 2019 at 7:34 PM Dominic Kim <st...@gmail.com> wrote:
> >
> > > Hi Mingyu
> > >
> > > Thank you for the good questions.
> > >
> > > Before answering to your question, I will share the Lease in ETCD
> first.
> > > ETCD has a data model which is disappear after given time if there is
> no
> > > relevant keepalive on it, the Lease.
> > >
> > > So once you grant a new lease, you can put it with data in each
> operation
> > > such as put, putTxn(transaction), etc.
> > > If there is no keep-alive for the given(configurable) time, inserted
> data
> > > will be gone.
> > >
> > > In my proposal, most of data in ETCD rely on a lease.
> > > For example, each scheduler stores their endpoint information(for queue
> > > creation) with a lease. Each queue stores their information(for
> > activation)
> > > in ETCD with a lease.
> > > (It is overhead to do keep-alive in each memory queue separately, I
> > > introduced EtcdKeepAliveService to share one global lease among all
> > queues
> > > in a same scheduler.)
> > > Each ContainerProxy store their information in ETCD with a lease so
> that
> > > when a queue tries to create a container, they can easily count the
> > number
> > > of existing containers with "Count" API.
> > > Both data are stored with a lease, if one scheduler or invoker are
> > failed,
> > > keep-alive for the given lease is not continued, and finally those data
> > > will be removed.
> > >
> > > Follower queues are watching on the leader queue information. If there
> > are
> > > any changes,(the data will be removed upon scheduler failure) they can
> > > receive the notification and start new leader election.
> > > When a scheduler is failed, ContainerProxys which were communicating
> > with a
> > > queue in that scheduler, will receive a connection error.
> > > Then they are designed to access to the ETCD again to figure out the
> > > endpoint of the leader queue.
> > > As one of followers becomes a new leader, ContainerProxys can connect
> to
> > > the new leader.
> > >
> > > One thing to note here is, there is only one QueueManager in each
> > > scheduler.
> > > One QueueManager holds all queues and delegate requests to the proper
> > queue
> > > in respond to "fetch" requests.
> > >
> > > In short, all endpoints data are stored in ETCD and they are renewed
> > based
> > > on keep-alive and lease.
> > > Each components are designed to access ETCD when the failure detected
> and
> > > connect to a new(failed-over) scheduler.
> > >
> > > I hope it is useful to you.
> > > And I think when I and my colleagues open PRs, we need to add more
> detail
> > > design along with them.
> > >
> > > If you have any further questions, kindly let me know.
> > >
> > > Thanks
> > > Best regards
> > > Dominic
> > >
> > >
> > >
> > > 2019년 4월 6일 (토) 오전 11:28, Mingyu Zhou <zh...@gmail.com>님이 작성:
> > >
> > > > Dear Dominic,
> > > >
> > > > Thanks for your proposal. It is very inspirational and it looks
> > > promising.
> > > >
> > > > I'd like to ask some questions about the fall over/failure recovery
> > > > mechanism of the scheduler component.
> > > >
> > > > IIUC, a scheduler instance hosts multiple queue managers. If a
> > scheduler
> > > is
> > > > down, we will lose multiple queue managers. Thus, there should be
> some
> > > form
> > > > of failure recovery of queue managers and it raises the following
> > > > questions:
> > > >
> > > > 1. In your proposal, how the failure of a scheduler is detected?
> I.e.,
> > > > when a scheduler instance is down and some queue manager become
> > > > unreachable, which component will be aware of this unavailability and
> > > then
> > > > trigger the recovery procedure?
> > > >
> > > > 2. How should the failure be recovered and lost queue managers be
> > brought
> > > > back to life? Specifically, in your proposal, you designed a hot
> > > > standing-by pairing of queue managers (one leader/two followers).
> Then
> > > how
> > > > should the new leader be selected in face of scheduler crash? And do
> we
> > > > need to allocate a new queue manager to maintain the
> > > > one-leader-two-follower configuration?
> > > >
> > > > 3. How will the other components in the system learn the new
> > > configuration
> > > > after a fall over? For example, how will the pool balancer discover
> the
> > > new
> > > > state of the scheduler it managers and change its policy to
> distribute
> > > > queue creation requests?
> > > >
> > > > Thanks
> > > > Mingyu Zhou
> > > >
> > > > On Fri, Apr 5, 2019 at 2:56 PM Dominic Kim <st...@gmail.com>
> > wrote:
> > > >
> > > > > Dear David, Matt, and Dascalita.
> > > > > Thank you for your interest in my proposal.
> > > > >
> > > > > Let me answer your questions one by one.
> > > > >
> > > > > @David
> > > > > Yes, I will(and actually already did) implement all components
> based
> > on
> > > > > SPI.
> > > > > The reason why I said "breaking changes" is, my proposal will
> affect
> > > most
> > > > > of components drastically.
> > > > > For example, InvokerReactive will become a SPI and current
> > > > InvokerReactive
> > > > > will become one of its concrete implementation.
> > > > > My load balancer and throttler are also based on the current SPI.
> > > > > So though my implementation would be included in OpenWhisk,
> > downstreams
> > > > > still can take advantage of existing implementation such as
> > > > > ShardingPoolBalancer.
> > > > >
> > > > > Regarding Leader/Follower, a fair point.
> > > > > The reason why I introduced such a model is to prepare for the
> future
> > > > > enhancement.
> > > > > Actually, I reached a conclusion that memory based activation
> passing
> > > > would
> > > > > be enough for OpenWhisk in terms of message persistence.
> > > > > But it is just my own opinion and community may want more rigid
> level
> > > of
> > > > > persistence.
> > > > > I naively thought we can add replication and HA logic in the queue
> > > which
> > > > > are similar to the one in Kafka.
> > > > > And Leader/Follower would be a good base building block for this.
> > > > >
> > > > > Currently only a leader fetch activation messages from Kafka.
> > Followers
> > > > > will be idle while watching the leadership change.
> > > > > Once the leadership is changed, one of followers will become a new
> > > leader
> > > > > and at that time, Kafka consumer for the new leader will be
> created.
> > > > > This is to minimize the failure handling time in the aspect of
> > clients
> > > as
> > > > > you mentioned. It is also correct that this flow does not prevent
> > > > > activation messages lost on scheduler failure.
> > > > > But it's not that complex as activation messages are not replicated
> > to
> > > > > followers and the number of followers are configurable.
> > > > > If we want, we can configure the number of required queue to 1,
> there
> > > > will
> > > > > be only one leader queue.
> > > > > (If we ok with the current level of persistence, I think we may not
> > > need
> > > > > more than 1 queue as you said.)
> > > > >
> > > > > Regarding pulling activation messages, each action will have its
> own
> > > > Kafka
> > > > > topic.
> > > > > It is same with what I proposed in my previous proposals.
> > > > > When an action is created, a Kafka topic for the action will be
> > > created.
> > > > > So each leader queue(consumer) will fetch activation messages from
> > its
> > > > own
> > > > > Kafka topic and there would be no intervention among actions.
> > > > >
> > > > > When I and my colleagues open PRs for each component, we will add
> > > detail
> > > > > component design.
> > > > > It would help you guys understand the proposal more.
> > > > >
> > > > > @Matt
> > > > > Thank you for the suggestion.
> > > > > If I change the name of it now, it would break the link in this
> > thread.
> > > > > I would use the name you suggested when I open a PR.
> > > > >
> > > > >
> > > > > @Dascalita
> > > > >
> > > > > Interesting idea.
> > > > > Any GC patterns do you keep in your mind to apply in OpenWhisk?
> > > > >
> > > > > In my proposal, the container GC is similar to what OpenWhisk does
> > > these
> > > > > days.
> > > > > Each container will autonomously fetch activations from the queue.
> > > > > Whenever they finish invocation of one activation, they will fetch
> > the
> > > > next
> > > > > request and invoke it.
> > > > > In this sense, we can maximize the container reuse.
> > > > >
> > > > > When there is no more activation message, ContainerProxy will be
> wait
> > > for
> > > > > the given time(configurable) and just stop.
> > > > > One difference is containers are no more paused, they are just
> > removed.
> > > > > Instead of pausing them, containers are waiting for subsequent
> > requests
> > > > for
> > > > > longer time(5~10s) than current implementation.
> > > > > This is because pausing is also relatively expensive operation in
> > > > > comparison to short-running invocation.
> > > > >
> > > > > Container lifecycle is managed in this way.
> > > > > 1. When a container is created, it will add its information in
> ETCD.
> > > > > 2. A queue will count the existing number of containers using above
> > > > > information.
> > > > > 3. Under heavy loads, the queue will request more containers if the
> > > > number
> > > > > of existing containers is less than its resource limit.
> > > > > 4. When the container is removed, it will delete its information in
> > > ETCD.
> > > > >
> > > > >
> > > > > Again, I really appreciate all your feedbacks and questions.
> > > > > If you have any further questions, kindly let me know.
> > > > >
> > > > > Best regards
> > > > > Dominic
> > > > >
> > > > >
> > > > >
> > > > > 2019년 4월 5일 (금) 오전 1:24, Dascalita Dragos <dd...@gmail.com>님이
> 작성:
> > > > >
> > > > > > Hi Dominic,
> > > > > > Thanks for sharing your ideas. IIUC (and pls keep me honest), the
> > > goal
> > > > of
> > > > > > the new design is to improve activation performance. I personally
> > > love
> > > > > > this; performance is a critical non-functional feature of any
> FaaS
> > > > > system.
> > > > > >
> > > > > > There’s something I’d like to call out: the management of
> > containers
> > > > in a
> > > > > > FaaS system could be compared to a JVM. A JVM allocates objects
> in
> > > > > memory,
> > > > > > and GC them. A FaaS system allocates containers to run actions,
> and
> > > it
> > > > > GCs
> > > > > > them when they become idle. If we could look at OW's scheduling
> > from
> > > > this
> > > > > > perspective, we could reuse the proven patterns in the JVM vs
> > > inventing
> > > > > > something new. I’d be interested on any GC implications in the
> new
> > > > > design,
> > > > > > meaning how idle actions get removed, and how is that
> orchestrated.
> > > > > >
> > > > > > Thanks,
> > > > > > dragos
> > > > > >
> > > > > >
> > > > > > On Thu, Apr 4, 2019 at 8:40 AM Matt Sicker <bo...@gmail.com>
> > wrote:
> > > > > >
> > > > > > > Would it make sense to define an OpenWhisk
> > Improvement/Enhancement
> > > > > > > Propoposal or similar that various other Apache projects do? We
> > > could
> > > > > > > call them WHIPs or something. :)
> > > > > > >
> > > > > > > On Thu, 4 Apr 2019 at 09:09, David P Grove <gr...@us.ibm.com>
> > > > wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > Dominic Kim <st...@gmail.com> wrote on 04/04/2019
> 04:37:19
> > > AM:
> > > > > > > > >
> > > > > > > > > I have proposed a new architecture.
> > > > > > > > >
> > > > > >
> > > https://cwiki.apache.org/confluence/display/OPENWHISK/New+architecture
> > > > > > > > +proposal
> > > > > > > > >
> > > > > > > > > It includes many controversial agendas and breaking
> changes.
> > > > > > > > > So I would like to form a general consensus on them.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Hi Dominic,
> > > > > > > >
> > > > > > > >         There's much to like about the proposal.  Thank you
> for
> > > > > writing
> > > > > > > it
> > > > > > > > up.
> > > > > > > >
> > > > > > > >         One meta-comment is that the work will have to be
> done
> > > in a
> > > > > way
> > > > > > > so
> > > > > > > > there are no actual "breaking changes".  It has to be
> possible
> > to
> > > > > > > continue
> > > > > > > > to configure the system using the existing architectures
> while
> > > this
> > > > > > work
> > > > > > > > proceeds.  I would expect this could be done via a new
> > > LoadBalancer
> > > > > and
> > > > > > > > some deployment options (similar to how Lean OpenWhisk was
> > done).
> > > > If
> > > > > > > work
> > > > > > > > needs to be done to generalize the LoadBalancer SPI, that
> could
> > > be
> > > > > done
> > > > > > > > early in the process.
> > > > > > > >
> > > > > > > >         On the proposal itself, I wonder if the complexity of
> > > > > > > Leader/Follower
> > > > > > > > is actually needed?  If a Scheduler crashes, it could be
> > > restarted
> > > > > and
> > > > > > > then
> > > > > > > > resume handling its assigned load.  I think there should be
> > > enough
> > > > > > > > information in etcd for it to recover its current set of
> > assigned
> > > > > > > > ContainerProxys and carry on.   Activations in its in memory
> > > queues
> > > > > > would
> > > > > > > > be lost (bigger blast radius than the current architecture),
> > but
> > > I
> > > > > > don't
> > > > > > > > see that the Leader/Follower changes that (seems way too
> > > expensive
> > > > to
> > > > > > be
> > > > > > > > replicating every activation in the Follower Queues).   The
> > > > > > > Leader/Follower
> > > > > > > > would allow for shorter downtime for those actions assigned
> to
> > > the
> > > > > > downed
> > > > > > > > Scheduler, but at the cost of significant complexity.  Is it
> > > worth
> > > > > it?
> > > > > > > >
> > > > > > > >         Perhaps related to the Leader/Follower, its not clear
> > to
> > > me
> > > > > how
> > > > > > > > activation messages are being pulled from the action topic in
> > > Kafka
> > > > > > > during
> > > > > > > > the Queue creation window. I think they have to go somewhere
> > > > (because
> > > > > > the
> > > > > > > > is a mix of actions on a single Kafka topic and we can't
> stall
> > > > other
> > > > > > > > actions while waiting for a Queue to be created for a new
> > > action),
> > > > > but
> > > > > > if
> > > > > > > > you don't know yet which Scheduler is going to win the race
> to
> > > be a
> > > > > > > Leader
> > > > > > > > how do you know where to put them?
> > > > > > > >
> > > > > > > > --dave
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Matt Sicker <bo...@gmail.com>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > 周明宇
> > > >
> > >
> >
>

Re: New architecture proposal

Posted by Dominic Kim <st...@gmail.com>.
Dear Dascalita

That depends on the timeout configuration.
For example, if you need something similar to the one in the current code
base, you can just configure the timeout to a small enough value, such as
50ms.

The idea behind the longer timeout is, it shows better performance when
there are highly likely subsequent requests.
For example, it takes about 100ms ~ 1s to create a new coldstart container.
If the action execution takes 10ms, it should wait 10 to 100 times more for
a new container.
In this case, it is reasonable to wait for the previous execution and reuse
the existing container rather than creating a new container.
So 100ms ~ 1s could be good options for the timeout value.
(Under heavy loads, I even observed it took 2s ~ 5s to create a coldstart
container.)
And this implies some changes in the notion of resources.

In the cluster, there would be a different kind of requests.
There would be both batch and real-time invocation.
So I think this is a tradeoff.
Longer timeout will increase the reuse rate of containers but idle
containers will possess resources longer.

And even in the current implementation, subsequent invocation should wait
for some time to remove existing(warmed containers) and create a new cold
start container.
As I said, it could take up to few seconds under heavy loads.
With reasonable timeout value, there would be no big performance difference
in the above situation.
(Actually, I expect new scheduler would outperform even with 5~10s timeout
value as it will evenly distribute docker operation.
In the current implementation, all execution is sent to the home invoker
first and it could make the situation worse in edge cases.
I hope I can share performance comparison results as I am doing
benchmarking.)

Also, I think the above case is an edge case that someone is consuming most
of the cluster resources and executing two different batch invocation
alternatively.
Anyway, we can support such an edge case with some shutdown period.
This can be controversial, but I believe this is a viable option.


If you said that in the view of OpenWhisk operator, I think you meant there
are more than 1 heavy users.
Let's say, one user has 60 containers limit and the other has 80 containers
limit.
Then can we guarantee both users' execution without any issue in current
implementation?
If their invocation requests come together, one or both of their invocation
will be heavily delayed.

So I think when we(operators) notice there are such heavy users, we should
scale out our clusters to guarantee their invocation or we should reduce
their resource limit.
This is also a tradeoff. If we must guarantee their invocation, we at least
need a bigger cluster than the sum of their throttling limit.
If we have weak SLA, we can support both users with smaller cluster though
their invocation can be delayed a bit.


In short, if you prefer the current behavior you can still have a similar
effect by configuring the timeout as 50ms.
(So containers will only wait for 50ms, though it may induce some
performance degradation in other cases.)

Best regards
Dominic


2019년 4월 10일 (수) 오전 1:36, Dascalita Dragos <dd...@gmail.com>님이 작성:

> "...When there is no more activation message, ContainerProxy will be wait
> for the given time(configurable) and just stop...."
>
> How does the system allocate and de-allocate resources when it's congested
> ?
> I'm thinking at the use case where the system receives a batch of
> activations that require 60% of all cluster resources. Once those
> activations finish, a different batch of activations are received, and this
> time the new batch requires new actions to be cold-started; these new
> activations require a total of 80% of the overall cluster resources. Unless
> the previous actions are removed, the cluster is over-allocated. In the
> current model would the cluster process 1/2 of the new activations b/c it
> needs to wait for the previous actions to stop by themselves ?
>
> On Sun, Apr 7, 2019 at 7:34 PM Dominic Kim <st...@gmail.com> wrote:
>
> > Hi Mingyu
> >
> > Thank you for the good questions.
> >
> > Before answering to your question, I will share the Lease in ETCD first.
> > ETCD has a data model which is disappear after given time if there is no
> > relevant keepalive on it, the Lease.
> >
> > So once you grant a new lease, you can put it with data in each operation
> > such as put, putTxn(transaction), etc.
> > If there is no keep-alive for the given(configurable) time, inserted data
> > will be gone.
> >
> > In my proposal, most of data in ETCD rely on a lease.
> > For example, each scheduler stores their endpoint information(for queue
> > creation) with a lease. Each queue stores their information(for
> activation)
> > in ETCD with a lease.
> > (It is overhead to do keep-alive in each memory queue separately, I
> > introduced EtcdKeepAliveService to share one global lease among all
> queues
> > in a same scheduler.)
> > Each ContainerProxy store their information in ETCD with a lease so that
> > when a queue tries to create a container, they can easily count the
> number
> > of existing containers with "Count" API.
> > Both data are stored with a lease, if one scheduler or invoker are
> failed,
> > keep-alive for the given lease is not continued, and finally those data
> > will be removed.
> >
> > Follower queues are watching on the leader queue information. If there
> are
> > any changes,(the data will be removed upon scheduler failure) they can
> > receive the notification and start new leader election.
> > When a scheduler is failed, ContainerProxys which were communicating
> with a
> > queue in that scheduler, will receive a connection error.
> > Then they are designed to access to the ETCD again to figure out the
> > endpoint of the leader queue.
> > As one of followers becomes a new leader, ContainerProxys can connect to
> > the new leader.
> >
> > One thing to note here is, there is only one QueueManager in each
> > scheduler.
> > One QueueManager holds all queues and delegate requests to the proper
> queue
> > in respond to "fetch" requests.
> >
> > In short, all endpoints data are stored in ETCD and they are renewed
> based
> > on keep-alive and lease.
> > Each components are designed to access ETCD when the failure detected and
> > connect to a new(failed-over) scheduler.
> >
> > I hope it is useful to you.
> > And I think when I and my colleagues open PRs, we need to add more detail
> > design along with them.
> >
> > If you have any further questions, kindly let me know.
> >
> > Thanks
> > Best regards
> > Dominic
> >
> >
> >
> > 2019년 4월 6일 (토) 오전 11:28, Mingyu Zhou <zh...@gmail.com>님이 작성:
> >
> > > Dear Dominic,
> > >
> > > Thanks for your proposal. It is very inspirational and it looks
> > promising.
> > >
> > > I'd like to ask some questions about the fall over/failure recovery
> > > mechanism of the scheduler component.
> > >
> > > IIUC, a scheduler instance hosts multiple queue managers. If a
> scheduler
> > is
> > > down, we will lose multiple queue managers. Thus, there should be some
> > form
> > > of failure recovery of queue managers and it raises the following
> > > questions:
> > >
> > > 1. In your proposal, how the failure of a scheduler is detected? I.e.,
> > > when a scheduler instance is down and some queue manager become
> > > unreachable, which component will be aware of this unavailability and
> > then
> > > trigger the recovery procedure?
> > >
> > > 2. How should the failure be recovered and lost queue managers be
> brought
> > > back to life? Specifically, in your proposal, you designed a hot
> > > standing-by pairing of queue managers (one leader/two followers). Then
> > how
> > > should the new leader be selected in face of scheduler crash? And do we
> > > need to allocate a new queue manager to maintain the
> > > one-leader-two-follower configuration?
> > >
> > > 3. How will the other components in the system learn the new
> > configuration
> > > after a fall over? For example, how will the pool balancer discover the
> > new
> > > state of the scheduler it managers and change its policy to distribute
> > > queue creation requests?
> > >
> > > Thanks
> > > Mingyu Zhou
> > >
> > > On Fri, Apr 5, 2019 at 2:56 PM Dominic Kim <st...@gmail.com>
> wrote:
> > >
> > > > Dear David, Matt, and Dascalita.
> > > > Thank you for your interest in my proposal.
> > > >
> > > > Let me answer your questions one by one.
> > > >
> > > > @David
> > > > Yes, I will(and actually already did) implement all components based
> on
> > > > SPI.
> > > > The reason why I said "breaking changes" is, my proposal will affect
> > most
> > > > of components drastically.
> > > > For example, InvokerReactive will become a SPI and current
> > > InvokerReactive
> > > > will become one of its concrete implementation.
> > > > My load balancer and throttler are also based on the current SPI.
> > > > So though my implementation would be included in OpenWhisk,
> downstreams
> > > > still can take advantage of existing implementation such as
> > > > ShardingPoolBalancer.
> > > >
> > > > Regarding Leader/Follower, a fair point.
> > > > The reason why I introduced such a model is to prepare for the future
> > > > enhancement.
> > > > Actually, I reached a conclusion that memory based activation passing
> > > would
> > > > be enough for OpenWhisk in terms of message persistence.
> > > > But it is just my own opinion and community may want more rigid level
> > of
> > > > persistence.
> > > > I naively thought we can add replication and HA logic in the queue
> > which
> > > > are similar to the one in Kafka.
> > > > And Leader/Follower would be a good base building block for this.
> > > >
> > > > Currently only a leader fetch activation messages from Kafka.
> Followers
> > > > will be idle while watching the leadership change.
> > > > Once the leadership is changed, one of followers will become a new
> > leader
> > > > and at that time, Kafka consumer for the new leader will be created.
> > > > This is to minimize the failure handling time in the aspect of
> clients
> > as
> > > > you mentioned. It is also correct that this flow does not prevent
> > > > activation messages lost on scheduler failure.
> > > > But it's not that complex as activation messages are not replicated
> to
> > > > followers and the number of followers are configurable.
> > > > If we want, we can configure the number of required queue to 1, there
> > > will
> > > > be only one leader queue.
> > > > (If we ok with the current level of persistence, I think we may not
> > need
> > > > more than 1 queue as you said.)
> > > >
> > > > Regarding pulling activation messages, each action will have its own
> > > Kafka
> > > > topic.
> > > > It is same with what I proposed in my previous proposals.
> > > > When an action is created, a Kafka topic for the action will be
> > created.
> > > > So each leader queue(consumer) will fetch activation messages from
> its
> > > own
> > > > Kafka topic and there would be no intervention among actions.
> > > >
> > > > When I and my colleagues open PRs for each component, we will add
> > detail
> > > > component design.
> > > > It would help you guys understand the proposal more.
> > > >
> > > > @Matt
> > > > Thank you for the suggestion.
> > > > If I change the name of it now, it would break the link in this
> thread.
> > > > I would use the name you suggested when I open a PR.
> > > >
> > > >
> > > > @Dascalita
> > > >
> > > > Interesting idea.
> > > > Any GC patterns do you keep in your mind to apply in OpenWhisk?
> > > >
> > > > In my proposal, the container GC is similar to what OpenWhisk does
> > these
> > > > days.
> > > > Each container will autonomously fetch activations from the queue.
> > > > Whenever they finish invocation of one activation, they will fetch
> the
> > > next
> > > > request and invoke it.
> > > > In this sense, we can maximize the container reuse.
> > > >
> > > > When there is no more activation message, ContainerProxy will be wait
> > for
> > > > the given time(configurable) and just stop.
> > > > One difference is containers are no more paused, they are just
> removed.
> > > > Instead of pausing them, containers are waiting for subsequent
> requests
> > > for
> > > > longer time(5~10s) than current implementation.
> > > > This is because pausing is also relatively expensive operation in
> > > > comparison to short-running invocation.
> > > >
> > > > Container lifecycle is managed in this way.
> > > > 1. When a container is created, it will add its information in ETCD.
> > > > 2. A queue will count the existing number of containers using above
> > > > information.
> > > > 3. Under heavy loads, the queue will request more containers if the
> > > number
> > > > of existing containers is less than its resource limit.
> > > > 4. When the container is removed, it will delete its information in
> > ETCD.
> > > >
> > > >
> > > > Again, I really appreciate all your feedbacks and questions.
> > > > If you have any further questions, kindly let me know.
> > > >
> > > > Best regards
> > > > Dominic
> > > >
> > > >
> > > >
> > > > 2019년 4월 5일 (금) 오전 1:24, Dascalita Dragos <dd...@gmail.com>님이 작성:
> > > >
> > > > > Hi Dominic,
> > > > > Thanks for sharing your ideas. IIUC (and pls keep me honest), the
> > goal
> > > of
> > > > > the new design is to improve activation performance. I personally
> > love
> > > > > this; performance is a critical non-functional feature of any FaaS
> > > > system.
> > > > >
> > > > > There’s something I’d like to call out: the management of
> containers
> > > in a
> > > > > FaaS system could be compared to a JVM. A JVM allocates objects in
> > > > memory,
> > > > > and GC them. A FaaS system allocates containers to run actions, and
> > it
> > > > GCs
> > > > > them when they become idle. If we could look at OW's scheduling
> from
> > > this
> > > > > perspective, we could reuse the proven patterns in the JVM vs
> > inventing
> > > > > something new. I’d be interested on any GC implications in the new
> > > > design,
> > > > > meaning how idle actions get removed, and how is that orchestrated.
> > > > >
> > > > > Thanks,
> > > > > dragos
> > > > >
> > > > >
> > > > > On Thu, Apr 4, 2019 at 8:40 AM Matt Sicker <bo...@gmail.com>
> wrote:
> > > > >
> > > > > > Would it make sense to define an OpenWhisk
> Improvement/Enhancement
> > > > > > Propoposal or similar that various other Apache projects do? We
> > could
> > > > > > call them WHIPs or something. :)
> > > > > >
> > > > > > On Thu, 4 Apr 2019 at 09:09, David P Grove <gr...@us.ibm.com>
> > > wrote:
> > > > > > >
> > > > > > >
> > > > > > > Dominic Kim <st...@gmail.com> wrote on 04/04/2019 04:37:19
> > AM:
> > > > > > > >
> > > > > > > > I have proposed a new architecture.
> > > > > > > >
> > > > >
> > https://cwiki.apache.org/confluence/display/OPENWHISK/New+architecture
> > > > > > > +proposal
> > > > > > > >
> > > > > > > > It includes many controversial agendas and breaking changes.
> > > > > > > > So I would like to form a general consensus on them.
> > > > > > > >
> > > > > > >
> > > > > > > Hi Dominic,
> > > > > > >
> > > > > > >         There's much to like about the proposal.  Thank you for
> > > > writing
> > > > > > it
> > > > > > > up.
> > > > > > >
> > > > > > >         One meta-comment is that the work will have to be done
> > in a
> > > > way
> > > > > > so
> > > > > > > there are no actual "breaking changes".  It has to be possible
> to
> > > > > > continue
> > > > > > > to configure the system using the existing architectures while
> > this
> > > > > work
> > > > > > > proceeds.  I would expect this could be done via a new
> > LoadBalancer
> > > > and
> > > > > > > some deployment options (similar to how Lean OpenWhisk was
> done).
> > > If
> > > > > > work
> > > > > > > needs to be done to generalize the LoadBalancer SPI, that could
> > be
> > > > done
> > > > > > > early in the process.
> > > > > > >
> > > > > > >         On the proposal itself, I wonder if the complexity of
> > > > > > Leader/Follower
> > > > > > > is actually needed?  If a Scheduler crashes, it could be
> > restarted
> > > > and
> > > > > > then
> > > > > > > resume handling its assigned load.  I think there should be
> > enough
> > > > > > > information in etcd for it to recover its current set of
> assigned
> > > > > > > ContainerProxys and carry on.   Activations in its in memory
> > queues
> > > > > would
> > > > > > > be lost (bigger blast radius than the current architecture),
> but
> > I
> > > > > don't
> > > > > > > see that the Leader/Follower changes that (seems way too
> > expensive
> > > to
> > > > > be
> > > > > > > replicating every activation in the Follower Queues).   The
> > > > > > Leader/Follower
> > > > > > > would allow for shorter downtime for those actions assigned to
> > the
> > > > > downed
> > > > > > > Scheduler, but at the cost of significant complexity.  Is it
> > worth
> > > > it?
> > > > > > >
> > > > > > >         Perhaps related to the Leader/Follower, its not clear
> to
> > me
> > > > how
> > > > > > > activation messages are being pulled from the action topic in
> > Kafka
> > > > > > during
> > > > > > > the Queue creation window. I think they have to go somewhere
> > > (because
> > > > > the
> > > > > > > is a mix of actions on a single Kafka topic and we can't stall
> > > other
> > > > > > > actions while waiting for a Queue to be created for a new
> > action),
> > > > but
> > > > > if
> > > > > > > you don't know yet which Scheduler is going to win the race to
> > be a
> > > > > > Leader
> > > > > > > how do you know where to put them?
> > > > > > >
> > > > > > > --dave
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Matt Sicker <bo...@gmail.com>
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > 周明宇
> > >
> >
>

Re: New architecture proposal

Posted by Dascalita Dragos <dd...@gmail.com>.
"...When there is no more activation message, ContainerProxy will be wait
for the given time(configurable) and just stop...."

How does the system allocate and de-allocate resources when it's congested
?
I'm thinking at the use case where the system receives a batch of
activations that require 60% of all cluster resources. Once those
activations finish, a different batch of activations are received, and this
time the new batch requires new actions to be cold-started; these new
activations require a total of 80% of the overall cluster resources. Unless
the previous actions are removed, the cluster is over-allocated. In the
current model would the cluster process 1/2 of the new activations b/c it
needs to wait for the previous actions to stop by themselves ?

On Sun, Apr 7, 2019 at 7:34 PM Dominic Kim <st...@gmail.com> wrote:

> Hi Mingyu
>
> Thank you for the good questions.
>
> Before answering to your question, I will share the Lease in ETCD first.
> ETCD has a data model which is disappear after given time if there is no
> relevant keepalive on it, the Lease.
>
> So once you grant a new lease, you can put it with data in each operation
> such as put, putTxn(transaction), etc.
> If there is no keep-alive for the given(configurable) time, inserted data
> will be gone.
>
> In my proposal, most of data in ETCD rely on a lease.
> For example, each scheduler stores their endpoint information(for queue
> creation) with a lease. Each queue stores their information(for activation)
> in ETCD with a lease.
> (It is overhead to do keep-alive in each memory queue separately, I
> introduced EtcdKeepAliveService to share one global lease among all queues
> in a same scheduler.)
> Each ContainerProxy store their information in ETCD with a lease so that
> when a queue tries to create a container, they can easily count the number
> of existing containers with "Count" API.
> Both data are stored with a lease, if one scheduler or invoker are failed,
> keep-alive for the given lease is not continued, and finally those data
> will be removed.
>
> Follower queues are watching on the leader queue information. If there are
> any changes,(the data will be removed upon scheduler failure) they can
> receive the notification and start new leader election.
> When a scheduler is failed, ContainerProxys which were communicating with a
> queue in that scheduler, will receive a connection error.
> Then they are designed to access to the ETCD again to figure out the
> endpoint of the leader queue.
> As one of followers becomes a new leader, ContainerProxys can connect to
> the new leader.
>
> One thing to note here is, there is only one QueueManager in each
> scheduler.
> One QueueManager holds all queues and delegate requests to the proper queue
> in respond to "fetch" requests.
>
> In short, all endpoints data are stored in ETCD and they are renewed based
> on keep-alive and lease.
> Each components are designed to access ETCD when the failure detected and
> connect to a new(failed-over) scheduler.
>
> I hope it is useful to you.
> And I think when I and my colleagues open PRs, we need to add more detail
> design along with them.
>
> If you have any further questions, kindly let me know.
>
> Thanks
> Best regards
> Dominic
>
>
>
> 2019년 4월 6일 (토) 오전 11:28, Mingyu Zhou <zh...@gmail.com>님이 작성:
>
> > Dear Dominic,
> >
> > Thanks for your proposal. It is very inspirational and it looks
> promising.
> >
> > I'd like to ask some questions about the fall over/failure recovery
> > mechanism of the scheduler component.
> >
> > IIUC, a scheduler instance hosts multiple queue managers. If a scheduler
> is
> > down, we will lose multiple queue managers. Thus, there should be some
> form
> > of failure recovery of queue managers and it raises the following
> > questions:
> >
> > 1. In your proposal, how the failure of a scheduler is detected? I.e.,
> > when a scheduler instance is down and some queue manager become
> > unreachable, which component will be aware of this unavailability and
> then
> > trigger the recovery procedure?
> >
> > 2. How should the failure be recovered and lost queue managers be brought
> > back to life? Specifically, in your proposal, you designed a hot
> > standing-by pairing of queue managers (one leader/two followers). Then
> how
> > should the new leader be selected in face of scheduler crash? And do we
> > need to allocate a new queue manager to maintain the
> > one-leader-two-follower configuration?
> >
> > 3. How will the other components in the system learn the new
> configuration
> > after a fall over? For example, how will the pool balancer discover the
> new
> > state of the scheduler it managers and change its policy to distribute
> > queue creation requests?
> >
> > Thanks
> > Mingyu Zhou
> >
> > On Fri, Apr 5, 2019 at 2:56 PM Dominic Kim <st...@gmail.com> wrote:
> >
> > > Dear David, Matt, and Dascalita.
> > > Thank you for your interest in my proposal.
> > >
> > > Let me answer your questions one by one.
> > >
> > > @David
> > > Yes, I will(and actually already did) implement all components based on
> > > SPI.
> > > The reason why I said "breaking changes" is, my proposal will affect
> most
> > > of components drastically.
> > > For example, InvokerReactive will become a SPI and current
> > InvokerReactive
> > > will become one of its concrete implementation.
> > > My load balancer and throttler are also based on the current SPI.
> > > So though my implementation would be included in OpenWhisk, downstreams
> > > still can take advantage of existing implementation such as
> > > ShardingPoolBalancer.
> > >
> > > Regarding Leader/Follower, a fair point.
> > > The reason why I introduced such a model is to prepare for the future
> > > enhancement.
> > > Actually, I reached a conclusion that memory based activation passing
> > would
> > > be enough for OpenWhisk in terms of message persistence.
> > > But it is just my own opinion and community may want more rigid level
> of
> > > persistence.
> > > I naively thought we can add replication and HA logic in the queue
> which
> > > are similar to the one in Kafka.
> > > And Leader/Follower would be a good base building block for this.
> > >
> > > Currently only a leader fetch activation messages from Kafka. Followers
> > > will be idle while watching the leadership change.
> > > Once the leadership is changed, one of followers will become a new
> leader
> > > and at that time, Kafka consumer for the new leader will be created.
> > > This is to minimize the failure handling time in the aspect of clients
> as
> > > you mentioned. It is also correct that this flow does not prevent
> > > activation messages lost on scheduler failure.
> > > But it's not that complex as activation messages are not replicated to
> > > followers and the number of followers are configurable.
> > > If we want, we can configure the number of required queue to 1, there
> > will
> > > be only one leader queue.
> > > (If we ok with the current level of persistence, I think we may not
> need
> > > more than 1 queue as you said.)
> > >
> > > Regarding pulling activation messages, each action will have its own
> > Kafka
> > > topic.
> > > It is same with what I proposed in my previous proposals.
> > > When an action is created, a Kafka topic for the action will be
> created.
> > > So each leader queue(consumer) will fetch activation messages from its
> > own
> > > Kafka topic and there would be no intervention among actions.
> > >
> > > When I and my colleagues open PRs for each component, we will add
> detail
> > > component design.
> > > It would help you guys understand the proposal more.
> > >
> > > @Matt
> > > Thank you for the suggestion.
> > > If I change the name of it now, it would break the link in this thread.
> > > I would use the name you suggested when I open a PR.
> > >
> > >
> > > @Dascalita
> > >
> > > Interesting idea.
> > > Any GC patterns do you keep in your mind to apply in OpenWhisk?
> > >
> > > In my proposal, the container GC is similar to what OpenWhisk does
> these
> > > days.
> > > Each container will autonomously fetch activations from the queue.
> > > Whenever they finish invocation of one activation, they will fetch the
> > next
> > > request and invoke it.
> > > In this sense, we can maximize the container reuse.
> > >
> > > When there is no more activation message, ContainerProxy will be wait
> for
> > > the given time(configurable) and just stop.
> > > One difference is containers are no more paused, they are just removed.
> > > Instead of pausing them, containers are waiting for subsequent requests
> > for
> > > longer time(5~10s) than current implementation.
> > > This is because pausing is also relatively expensive operation in
> > > comparison to short-running invocation.
> > >
> > > Container lifecycle is managed in this way.
> > > 1. When a container is created, it will add its information in ETCD.
> > > 2. A queue will count the existing number of containers using above
> > > information.
> > > 3. Under heavy loads, the queue will request more containers if the
> > number
> > > of existing containers is less than its resource limit.
> > > 4. When the container is removed, it will delete its information in
> ETCD.
> > >
> > >
> > > Again, I really appreciate all your feedbacks and questions.
> > > If you have any further questions, kindly let me know.
> > >
> > > Best regards
> > > Dominic
> > >
> > >
> > >
> > > 2019년 4월 5일 (금) 오전 1:24, Dascalita Dragos <dd...@gmail.com>님이 작성:
> > >
> > > > Hi Dominic,
> > > > Thanks for sharing your ideas. IIUC (and pls keep me honest), the
> goal
> > of
> > > > the new design is to improve activation performance. I personally
> love
> > > > this; performance is a critical non-functional feature of any FaaS
> > > system.
> > > >
> > > > There’s something I’d like to call out: the management of containers
> > in a
> > > > FaaS system could be compared to a JVM. A JVM allocates objects in
> > > memory,
> > > > and GC them. A FaaS system allocates containers to run actions, and
> it
> > > GCs
> > > > them when they become idle. If we could look at OW's scheduling from
> > this
> > > > perspective, we could reuse the proven patterns in the JVM vs
> inventing
> > > > something new. I’d be interested on any GC implications in the new
> > > design,
> > > > meaning how idle actions get removed, and how is that orchestrated.
> > > >
> > > > Thanks,
> > > > dragos
> > > >
> > > >
> > > > On Thu, Apr 4, 2019 at 8:40 AM Matt Sicker <bo...@gmail.com> wrote:
> > > >
> > > > > Would it make sense to define an OpenWhisk Improvement/Enhancement
> > > > > Propoposal or similar that various other Apache projects do? We
> could
> > > > > call them WHIPs or something. :)
> > > > >
> > > > > On Thu, 4 Apr 2019 at 09:09, David P Grove <gr...@us.ibm.com>
> > wrote:
> > > > > >
> > > > > >
> > > > > > Dominic Kim <st...@gmail.com> wrote on 04/04/2019 04:37:19
> AM:
> > > > > > >
> > > > > > > I have proposed a new architecture.
> > > > > > >
> > > >
> https://cwiki.apache.org/confluence/display/OPENWHISK/New+architecture
> > > > > > +proposal
> > > > > > >
> > > > > > > It includes many controversial agendas and breaking changes.
> > > > > > > So I would like to form a general consensus on them.
> > > > > > >
> > > > > >
> > > > > > Hi Dominic,
> > > > > >
> > > > > >         There's much to like about the proposal.  Thank you for
> > > writing
> > > > > it
> > > > > > up.
> > > > > >
> > > > > >         One meta-comment is that the work will have to be done
> in a
> > > way
> > > > > so
> > > > > > there are no actual "breaking changes".  It has to be possible to
> > > > > continue
> > > > > > to configure the system using the existing architectures while
> this
> > > > work
> > > > > > proceeds.  I would expect this could be done via a new
> LoadBalancer
> > > and
> > > > > > some deployment options (similar to how Lean OpenWhisk was done).
> > If
> > > > > work
> > > > > > needs to be done to generalize the LoadBalancer SPI, that could
> be
> > > done
> > > > > > early in the process.
> > > > > >
> > > > > >         On the proposal itself, I wonder if the complexity of
> > > > > Leader/Follower
> > > > > > is actually needed?  If a Scheduler crashes, it could be
> restarted
> > > and
> > > > > then
> > > > > > resume handling its assigned load.  I think there should be
> enough
> > > > > > information in etcd for it to recover its current set of assigned
> > > > > > ContainerProxys and carry on.   Activations in its in memory
> queues
> > > > would
> > > > > > be lost (bigger blast radius than the current architecture), but
> I
> > > > don't
> > > > > > see that the Leader/Follower changes that (seems way too
> expensive
> > to
> > > > be
> > > > > > replicating every activation in the Follower Queues).   The
> > > > > Leader/Follower
> > > > > > would allow for shorter downtime for those actions assigned to
> the
> > > > downed
> > > > > > Scheduler, but at the cost of significant complexity.  Is it
> worth
> > > it?
> > > > > >
> > > > > >         Perhaps related to the Leader/Follower, its not clear to
> me
> > > how
> > > > > > activation messages are being pulled from the action topic in
> Kafka
> > > > > during
> > > > > > the Queue creation window. I think they have to go somewhere
> > (because
> > > > the
> > > > > > is a mix of actions on a single Kafka topic and we can't stall
> > other
> > > > > > actions while waiting for a Queue to be created for a new
> action),
> > > but
> > > > if
> > > > > > you don't know yet which Scheduler is going to win the race to
> be a
> > > > > Leader
> > > > > > how do you know where to put them?
> > > > > >
> > > > > > --dave
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Matt Sicker <bo...@gmail.com>
> > > > >
> > > >
> > >
> >
> >
> > --
> > 周明宇
> >
>

Re: New architecture proposal

Posted by Dominic Kim <st...@gmail.com>.
Hi Mingyu

Thank you for the good questions.

Before answering to your question, I will share the Lease in ETCD first.
ETCD has a data model which is disappear after given time if there is no
relevant keepalive on it, the Lease.

So once you grant a new lease, you can put it with data in each operation
such as put, putTxn(transaction), etc.
If there is no keep-alive for the given(configurable) time, inserted data
will be gone.

In my proposal, most of data in ETCD rely on a lease.
For example, each scheduler stores their endpoint information(for queue
creation) with a lease. Each queue stores their information(for activation)
in ETCD with a lease.
(It is overhead to do keep-alive in each memory queue separately, I
introduced EtcdKeepAliveService to share one global lease among all queues
in a same scheduler.)
Each ContainerProxy store their information in ETCD with a lease so that
when a queue tries to create a container, they can easily count the number
of existing containers with "Count" API.
Both data are stored with a lease, if one scheduler or invoker are failed,
keep-alive for the given lease is not continued, and finally those data
will be removed.

Follower queues are watching on the leader queue information. If there are
any changes,(the data will be removed upon scheduler failure) they can
receive the notification and start new leader election.
When a scheduler is failed, ContainerProxys which were communicating with a
queue in that scheduler, will receive a connection error.
Then they are designed to access to the ETCD again to figure out the
endpoint of the leader queue.
As one of followers becomes a new leader, ContainerProxys can connect to
the new leader.

One thing to note here is, there is only one QueueManager in each scheduler.
One QueueManager holds all queues and delegate requests to the proper queue
in respond to "fetch" requests.

In short, all endpoints data are stored in ETCD and they are renewed based
on keep-alive and lease.
Each components are designed to access ETCD when the failure detected and
connect to a new(failed-over) scheduler.

I hope it is useful to you.
And I think when I and my colleagues open PRs, we need to add more detail
design along with them.

If you have any further questions, kindly let me know.

Thanks
Best regards
Dominic



2019년 4월 6일 (토) 오전 11:28, Mingyu Zhou <zh...@gmail.com>님이 작성:

> Dear Dominic,
>
> Thanks for your proposal. It is very inspirational and it looks promising.
>
> I'd like to ask some questions about the fall over/failure recovery
> mechanism of the scheduler component.
>
> IIUC, a scheduler instance hosts multiple queue managers. If a scheduler is
> down, we will lose multiple queue managers. Thus, there should be some form
> of failure recovery of queue managers and it raises the following
> questions:
>
> 1. In your proposal, how the failure of a scheduler is detected? I.e.,
> when a scheduler instance is down and some queue manager become
> unreachable, which component will be aware of this unavailability and then
> trigger the recovery procedure?
>
> 2. How should the failure be recovered and lost queue managers be brought
> back to life? Specifically, in your proposal, you designed a hot
> standing-by pairing of queue managers (one leader/two followers). Then how
> should the new leader be selected in face of scheduler crash? And do we
> need to allocate a new queue manager to maintain the
> one-leader-two-follower configuration?
>
> 3. How will the other components in the system learn the new configuration
> after a fall over? For example, how will the pool balancer discover the new
> state of the scheduler it managers and change its policy to distribute
> queue creation requests?
>
> Thanks
> Mingyu Zhou
>
> On Fri, Apr 5, 2019 at 2:56 PM Dominic Kim <st...@gmail.com> wrote:
>
> > Dear David, Matt, and Dascalita.
> > Thank you for your interest in my proposal.
> >
> > Let me answer your questions one by one.
> >
> > @David
> > Yes, I will(and actually already did) implement all components based on
> > SPI.
> > The reason why I said "breaking changes" is, my proposal will affect most
> > of components drastically.
> > For example, InvokerReactive will become a SPI and current
> InvokerReactive
> > will become one of its concrete implementation.
> > My load balancer and throttler are also based on the current SPI.
> > So though my implementation would be included in OpenWhisk, downstreams
> > still can take advantage of existing implementation such as
> > ShardingPoolBalancer.
> >
> > Regarding Leader/Follower, a fair point.
> > The reason why I introduced such a model is to prepare for the future
> > enhancement.
> > Actually, I reached a conclusion that memory based activation passing
> would
> > be enough for OpenWhisk in terms of message persistence.
> > But it is just my own opinion and community may want more rigid level of
> > persistence.
> > I naively thought we can add replication and HA logic in the queue which
> > are similar to the one in Kafka.
> > And Leader/Follower would be a good base building block for this.
> >
> > Currently only a leader fetch activation messages from Kafka. Followers
> > will be idle while watching the leadership change.
> > Once the leadership is changed, one of followers will become a new leader
> > and at that time, Kafka consumer for the new leader will be created.
> > This is to minimize the failure handling time in the aspect of clients as
> > you mentioned. It is also correct that this flow does not prevent
> > activation messages lost on scheduler failure.
> > But it's not that complex as activation messages are not replicated to
> > followers and the number of followers are configurable.
> > If we want, we can configure the number of required queue to 1, there
> will
> > be only one leader queue.
> > (If we ok with the current level of persistence, I think we may not need
> > more than 1 queue as you said.)
> >
> > Regarding pulling activation messages, each action will have its own
> Kafka
> > topic.
> > It is same with what I proposed in my previous proposals.
> > When an action is created, a Kafka topic for the action will be created.
> > So each leader queue(consumer) will fetch activation messages from its
> own
> > Kafka topic and there would be no intervention among actions.
> >
> > When I and my colleagues open PRs for each component, we will add detail
> > component design.
> > It would help you guys understand the proposal more.
> >
> > @Matt
> > Thank you for the suggestion.
> > If I change the name of it now, it would break the link in this thread.
> > I would use the name you suggested when I open a PR.
> >
> >
> > @Dascalita
> >
> > Interesting idea.
> > Any GC patterns do you keep in your mind to apply in OpenWhisk?
> >
> > In my proposal, the container GC is similar to what OpenWhisk does these
> > days.
> > Each container will autonomously fetch activations from the queue.
> > Whenever they finish invocation of one activation, they will fetch the
> next
> > request and invoke it.
> > In this sense, we can maximize the container reuse.
> >
> > When there is no more activation message, ContainerProxy will be wait for
> > the given time(configurable) and just stop.
> > One difference is containers are no more paused, they are just removed.
> > Instead of pausing them, containers are waiting for subsequent requests
> for
> > longer time(5~10s) than current implementation.
> > This is because pausing is also relatively expensive operation in
> > comparison to short-running invocation.
> >
> > Container lifecycle is managed in this way.
> > 1. When a container is created, it will add its information in ETCD.
> > 2. A queue will count the existing number of containers using above
> > information.
> > 3. Under heavy loads, the queue will request more containers if the
> number
> > of existing containers is less than its resource limit.
> > 4. When the container is removed, it will delete its information in ETCD.
> >
> >
> > Again, I really appreciate all your feedbacks and questions.
> > If you have any further questions, kindly let me know.
> >
> > Best regards
> > Dominic
> >
> >
> >
> > 2019년 4월 5일 (금) 오전 1:24, Dascalita Dragos <dd...@gmail.com>님이 작성:
> >
> > > Hi Dominic,
> > > Thanks for sharing your ideas. IIUC (and pls keep me honest), the goal
> of
> > > the new design is to improve activation performance. I personally love
> > > this; performance is a critical non-functional feature of any FaaS
> > system.
> > >
> > > There’s something I’d like to call out: the management of containers
> in a
> > > FaaS system could be compared to a JVM. A JVM allocates objects in
> > memory,
> > > and GC them. A FaaS system allocates containers to run actions, and it
> > GCs
> > > them when they become idle. If we could look at OW's scheduling from
> this
> > > perspective, we could reuse the proven patterns in the JVM vs inventing
> > > something new. I’d be interested on any GC implications in the new
> > design,
> > > meaning how idle actions get removed, and how is that orchestrated.
> > >
> > > Thanks,
> > > dragos
> > >
> > >
> > > On Thu, Apr 4, 2019 at 8:40 AM Matt Sicker <bo...@gmail.com> wrote:
> > >
> > > > Would it make sense to define an OpenWhisk Improvement/Enhancement
> > > > Propoposal or similar that various other Apache projects do? We could
> > > > call them WHIPs or something. :)
> > > >
> > > > On Thu, 4 Apr 2019 at 09:09, David P Grove <gr...@us.ibm.com>
> wrote:
> > > > >
> > > > >
> > > > > Dominic Kim <st...@gmail.com> wrote on 04/04/2019 04:37:19 AM:
> > > > > >
> > > > > > I have proposed a new architecture.
> > > > > >
> > > https://cwiki.apache.org/confluence/display/OPENWHISK/New+architecture
> > > > > +proposal
> > > > > >
> > > > > > It includes many controversial agendas and breaking changes.
> > > > > > So I would like to form a general consensus on them.
> > > > > >
> > > > >
> > > > > Hi Dominic,
> > > > >
> > > > >         There's much to like about the proposal.  Thank you for
> > writing
> > > > it
> > > > > up.
> > > > >
> > > > >         One meta-comment is that the work will have to be done in a
> > way
> > > > so
> > > > > there are no actual "breaking changes".  It has to be possible to
> > > > continue
> > > > > to configure the system using the existing architectures while this
> > > work
> > > > > proceeds.  I would expect this could be done via a new LoadBalancer
> > and
> > > > > some deployment options (similar to how Lean OpenWhisk was done).
> If
> > > > work
> > > > > needs to be done to generalize the LoadBalancer SPI, that could be
> > done
> > > > > early in the process.
> > > > >
> > > > >         On the proposal itself, I wonder if the complexity of
> > > > Leader/Follower
> > > > > is actually needed?  If a Scheduler crashes, it could be restarted
> > and
> > > > then
> > > > > resume handling its assigned load.  I think there should be enough
> > > > > information in etcd for it to recover its current set of assigned
> > > > > ContainerProxys and carry on.   Activations in its in memory queues
> > > would
> > > > > be lost (bigger blast radius than the current architecture), but I
> > > don't
> > > > > see that the Leader/Follower changes that (seems way too expensive
> to
> > > be
> > > > > replicating every activation in the Follower Queues).   The
> > > > Leader/Follower
> > > > > would allow for shorter downtime for those actions assigned to the
> > > downed
> > > > > Scheduler, but at the cost of significant complexity.  Is it worth
> > it?
> > > > >
> > > > >         Perhaps related to the Leader/Follower, its not clear to me
> > how
> > > > > activation messages are being pulled from the action topic in Kafka
> > > > during
> > > > > the Queue creation window. I think they have to go somewhere
> (because
> > > the
> > > > > is a mix of actions on a single Kafka topic and we can't stall
> other
> > > > > actions while waiting for a Queue to be created for a new action),
> > but
> > > if
> > > > > you don't know yet which Scheduler is going to win the race to be a
> > > > Leader
> > > > > how do you know where to put them?
> > > > >
> > > > > --dave
> > > >
> > > >
> > > >
> > > > --
> > > > Matt Sicker <bo...@gmail.com>
> > > >
> > >
> >
>
>
> --
> 周明宇
>

Re: New architecture proposal

Posted by Mingyu Zhou <zh...@gmail.com>.
Dear Dominic,

Thanks for your proposal. It is very inspirational and it looks promising.

I'd like to ask some questions about the fall over/failure recovery
mechanism of the scheduler component.

IIUC, a scheduler instance hosts multiple queue managers. If a scheduler is
down, we will lose multiple queue managers. Thus, there should be some form
of failure recovery of queue managers and it raises the following questions:

1. In your proposal, how the failure of a scheduler is detected? I.e.,
when a scheduler instance is down and some queue manager become
unreachable, which component will be aware of this unavailability and then
trigger the recovery procedure?

2. How should the failure be recovered and lost queue managers be brought
back to life? Specifically, in your proposal, you designed a hot
standing-by pairing of queue managers (one leader/two followers). Then how
should the new leader be selected in face of scheduler crash? And do we
need to allocate a new queue manager to maintain the
one-leader-two-follower configuration?

3. How will the other components in the system learn the new configuration
after a fall over? For example, how will the pool balancer discover the new
state of the scheduler it managers and change its policy to distribute
queue creation requests?

Thanks
Mingyu Zhou

On Fri, Apr 5, 2019 at 2:56 PM Dominic Kim <st...@gmail.com> wrote:

> Dear David, Matt, and Dascalita.
> Thank you for your interest in my proposal.
>
> Let me answer your questions one by one.
>
> @David
> Yes, I will(and actually already did) implement all components based on
> SPI.
> The reason why I said "breaking changes" is, my proposal will affect most
> of components drastically.
> For example, InvokerReactive will become a SPI and current InvokerReactive
> will become one of its concrete implementation.
> My load balancer and throttler are also based on the current SPI.
> So though my implementation would be included in OpenWhisk, downstreams
> still can take advantage of existing implementation such as
> ShardingPoolBalancer.
>
> Regarding Leader/Follower, a fair point.
> The reason why I introduced such a model is to prepare for the future
> enhancement.
> Actually, I reached a conclusion that memory based activation passing would
> be enough for OpenWhisk in terms of message persistence.
> But it is just my own opinion and community may want more rigid level of
> persistence.
> I naively thought we can add replication and HA logic in the queue which
> are similar to the one in Kafka.
> And Leader/Follower would be a good base building block for this.
>
> Currently only a leader fetch activation messages from Kafka. Followers
> will be idle while watching the leadership change.
> Once the leadership is changed, one of followers will become a new leader
> and at that time, Kafka consumer for the new leader will be created.
> This is to minimize the failure handling time in the aspect of clients as
> you mentioned. It is also correct that this flow does not prevent
> activation messages lost on scheduler failure.
> But it's not that complex as activation messages are not replicated to
> followers and the number of followers are configurable.
> If we want, we can configure the number of required queue to 1, there will
> be only one leader queue.
> (If we ok with the current level of persistence, I think we may not need
> more than 1 queue as you said.)
>
> Regarding pulling activation messages, each action will have its own Kafka
> topic.
> It is same with what I proposed in my previous proposals.
> When an action is created, a Kafka topic for the action will be created.
> So each leader queue(consumer) will fetch activation messages from its own
> Kafka topic and there would be no intervention among actions.
>
> When I and my colleagues open PRs for each component, we will add detail
> component design.
> It would help you guys understand the proposal more.
>
> @Matt
> Thank you for the suggestion.
> If I change the name of it now, it would break the link in this thread.
> I would use the name you suggested when I open a PR.
>
>
> @Dascalita
>
> Interesting idea.
> Any GC patterns do you keep in your mind to apply in OpenWhisk?
>
> In my proposal, the container GC is similar to what OpenWhisk does these
> days.
> Each container will autonomously fetch activations from the queue.
> Whenever they finish invocation of one activation, they will fetch the next
> request and invoke it.
> In this sense, we can maximize the container reuse.
>
> When there is no more activation message, ContainerProxy will be wait for
> the given time(configurable) and just stop.
> One difference is containers are no more paused, they are just removed.
> Instead of pausing them, containers are waiting for subsequent requests for
> longer time(5~10s) than current implementation.
> This is because pausing is also relatively expensive operation in
> comparison to short-running invocation.
>
> Container lifecycle is managed in this way.
> 1. When a container is created, it will add its information in ETCD.
> 2. A queue will count the existing number of containers using above
> information.
> 3. Under heavy loads, the queue will request more containers if the number
> of existing containers is less than its resource limit.
> 4. When the container is removed, it will delete its information in ETCD.
>
>
> Again, I really appreciate all your feedbacks and questions.
> If you have any further questions, kindly let me know.
>
> Best regards
> Dominic
>
>
>
> 2019년 4월 5일 (금) 오전 1:24, Dascalita Dragos <dd...@gmail.com>님이 작성:
>
> > Hi Dominic,
> > Thanks for sharing your ideas. IIUC (and pls keep me honest), the goal of
> > the new design is to improve activation performance. I personally love
> > this; performance is a critical non-functional feature of any FaaS
> system.
> >
> > There’s something I’d like to call out: the management of containers in a
> > FaaS system could be compared to a JVM. A JVM allocates objects in
> memory,
> > and GC them. A FaaS system allocates containers to run actions, and it
> GCs
> > them when they become idle. If we could look at OW's scheduling from this
> > perspective, we could reuse the proven patterns in the JVM vs inventing
> > something new. I’d be interested on any GC implications in the new
> design,
> > meaning how idle actions get removed, and how is that orchestrated.
> >
> > Thanks,
> > dragos
> >
> >
> > On Thu, Apr 4, 2019 at 8:40 AM Matt Sicker <bo...@gmail.com> wrote:
> >
> > > Would it make sense to define an OpenWhisk Improvement/Enhancement
> > > Propoposal or similar that various other Apache projects do? We could
> > > call them WHIPs or something. :)
> > >
> > > On Thu, 4 Apr 2019 at 09:09, David P Grove <gr...@us.ibm.com> wrote:
> > > >
> > > >
> > > > Dominic Kim <st...@gmail.com> wrote on 04/04/2019 04:37:19 AM:
> > > > >
> > > > > I have proposed a new architecture.
> > > > >
> > https://cwiki.apache.org/confluence/display/OPENWHISK/New+architecture
> > > > +proposal
> > > > >
> > > > > It includes many controversial agendas and breaking changes.
> > > > > So I would like to form a general consensus on them.
> > > > >
> > > >
> > > > Hi Dominic,
> > > >
> > > >         There's much to like about the proposal.  Thank you for
> writing
> > > it
> > > > up.
> > > >
> > > >         One meta-comment is that the work will have to be done in a
> way
> > > so
> > > > there are no actual "breaking changes".  It has to be possible to
> > > continue
> > > > to configure the system using the existing architectures while this
> > work
> > > > proceeds.  I would expect this could be done via a new LoadBalancer
> and
> > > > some deployment options (similar to how Lean OpenWhisk was done).  If
> > > work
> > > > needs to be done to generalize the LoadBalancer SPI, that could be
> done
> > > > early in the process.
> > > >
> > > >         On the proposal itself, I wonder if the complexity of
> > > Leader/Follower
> > > > is actually needed?  If a Scheduler crashes, it could be restarted
> and
> > > then
> > > > resume handling its assigned load.  I think there should be enough
> > > > information in etcd for it to recover its current set of assigned
> > > > ContainerProxys and carry on.   Activations in its in memory queues
> > would
> > > > be lost (bigger blast radius than the current architecture), but I
> > don't
> > > > see that the Leader/Follower changes that (seems way too expensive to
> > be
> > > > replicating every activation in the Follower Queues).   The
> > > Leader/Follower
> > > > would allow for shorter downtime for those actions assigned to the
> > downed
> > > > Scheduler, but at the cost of significant complexity.  Is it worth
> it?
> > > >
> > > >         Perhaps related to the Leader/Follower, its not clear to me
> how
> > > > activation messages are being pulled from the action topic in Kafka
> > > during
> > > > the Queue creation window. I think they have to go somewhere (because
> > the
> > > > is a mix of actions on a single Kafka topic and we can't stall other
> > > > actions while waiting for a Queue to be created for a new action),
> but
> > if
> > > > you don't know yet which Scheduler is going to win the race to be a
> > > Leader
> > > > how do you know where to put them?
> > > >
> > > > --dave
> > >
> > >
> > >
> > > --
> > > Matt Sicker <bo...@gmail.com>
> > >
> >
>


-- 
周明宇

Re: New architecture proposal

Posted by Dominic Kim <st...@gmail.com>.
Dear David, Matt, and Dascalita.
Thank you for your interest in my proposal.

Let me answer your questions one by one.

@David
Yes, I will(and actually already did) implement all components based on SPI.
The reason why I said "breaking changes" is, my proposal will affect most
of components drastically.
For example, InvokerReactive will become a SPI and current InvokerReactive
will become one of its concrete implementation.
My load balancer and throttler are also based on the current SPI.
So though my implementation would be included in OpenWhisk, downstreams
still can take advantage of existing implementation such as
ShardingPoolBalancer.

Regarding Leader/Follower, a fair point.
The reason why I introduced such a model is to prepare for the future
enhancement.
Actually, I reached a conclusion that memory based activation passing would
be enough for OpenWhisk in terms of message persistence.
But it is just my own opinion and community may want more rigid level of
persistence.
I naively thought we can add replication and HA logic in the queue which
are similar to the one in Kafka.
And Leader/Follower would be a good base building block for this.

Currently only a leader fetch activation messages from Kafka. Followers
will be idle while watching the leadership change.
Once the leadership is changed, one of followers will become a new leader
and at that time, Kafka consumer for the new leader will be created.
This is to minimize the failure handling time in the aspect of clients as
you mentioned. It is also correct that this flow does not prevent
activation messages lost on scheduler failure.
But it's not that complex as activation messages are not replicated to
followers and the number of followers are configurable.
If we want, we can configure the number of required queue to 1, there will
be only one leader queue.
(If we ok with the current level of persistence, I think we may not need
more than 1 queue as you said.)

Regarding pulling activation messages, each action will have its own Kafka
topic.
It is same with what I proposed in my previous proposals.
When an action is created, a Kafka topic for the action will be created.
So each leader queue(consumer) will fetch activation messages from its own
Kafka topic and there would be no intervention among actions.

When I and my colleagues open PRs for each component, we will add detail
component design.
It would help you guys understand the proposal more.

@Matt
Thank you for the suggestion.
If I change the name of it now, it would break the link in this thread.
I would use the name you suggested when I open a PR.


@Dascalita

Interesting idea.
Any GC patterns do you keep in your mind to apply in OpenWhisk?

In my proposal, the container GC is similar to what OpenWhisk does these
days.
Each container will autonomously fetch activations from the queue.
Whenever they finish invocation of one activation, they will fetch the next
request and invoke it.
In this sense, we can maximize the container reuse.

When there is no more activation message, ContainerProxy will be wait for
the given time(configurable) and just stop.
One difference is containers are no more paused, they are just removed.
Instead of pausing them, containers are waiting for subsequent requests for
longer time(5~10s) than current implementation.
This is because pausing is also relatively expensive operation in
comparison to short-running invocation.

Container lifecycle is managed in this way.
1. When a container is created, it will add its information in ETCD.
2. A queue will count the existing number of containers using above
information.
3. Under heavy loads, the queue will request more containers if the number
of existing containers is less than its resource limit.
4. When the container is removed, it will delete its information in ETCD.


Again, I really appreciate all your feedbacks and questions.
If you have any further questions, kindly let me know.

Best regards
Dominic



2019년 4월 5일 (금) 오전 1:24, Dascalita Dragos <dd...@gmail.com>님이 작성:

> Hi Dominic,
> Thanks for sharing your ideas. IIUC (and pls keep me honest), the goal of
> the new design is to improve activation performance. I personally love
> this; performance is a critical non-functional feature of any FaaS system.
>
> There’s something I’d like to call out: the management of containers in a
> FaaS system could be compared to a JVM. A JVM allocates objects in memory,
> and GC them. A FaaS system allocates containers to run actions, and it GCs
> them when they become idle. If we could look at OW's scheduling from this
> perspective, we could reuse the proven patterns in the JVM vs inventing
> something new. I’d be interested on any GC implications in the new design,
> meaning how idle actions get removed, and how is that orchestrated.
>
> Thanks,
> dragos
>
>
> On Thu, Apr 4, 2019 at 8:40 AM Matt Sicker <bo...@gmail.com> wrote:
>
> > Would it make sense to define an OpenWhisk Improvement/Enhancement
> > Propoposal or similar that various other Apache projects do? We could
> > call them WHIPs or something. :)
> >
> > On Thu, 4 Apr 2019 at 09:09, David P Grove <gr...@us.ibm.com> wrote:
> > >
> > >
> > > Dominic Kim <st...@gmail.com> wrote on 04/04/2019 04:37:19 AM:
> > > >
> > > > I have proposed a new architecture.
> > > >
> https://cwiki.apache.org/confluence/display/OPENWHISK/New+architecture
> > > +proposal
> > > >
> > > > It includes many controversial agendas and breaking changes.
> > > > So I would like to form a general consensus on them.
> > > >
> > >
> > > Hi Dominic,
> > >
> > >         There's much to like about the proposal.  Thank you for writing
> > it
> > > up.
> > >
> > >         One meta-comment is that the work will have to be done in a way
> > so
> > > there are no actual "breaking changes".  It has to be possible to
> > continue
> > > to configure the system using the existing architectures while this
> work
> > > proceeds.  I would expect this could be done via a new LoadBalancer and
> > > some deployment options (similar to how Lean OpenWhisk was done).  If
> > work
> > > needs to be done to generalize the LoadBalancer SPI, that could be done
> > > early in the process.
> > >
> > >         On the proposal itself, I wonder if the complexity of
> > Leader/Follower
> > > is actually needed?  If a Scheduler crashes, it could be restarted and
> > then
> > > resume handling its assigned load.  I think there should be enough
> > > information in etcd for it to recover its current set of assigned
> > > ContainerProxys and carry on.   Activations in its in memory queues
> would
> > > be lost (bigger blast radius than the current architecture), but I
> don't
> > > see that the Leader/Follower changes that (seems way too expensive to
> be
> > > replicating every activation in the Follower Queues).   The
> > Leader/Follower
> > > would allow for shorter downtime for those actions assigned to the
> downed
> > > Scheduler, but at the cost of significant complexity.  Is it worth it?
> > >
> > >         Perhaps related to the Leader/Follower, its not clear to me how
> > > activation messages are being pulled from the action topic in Kafka
> > during
> > > the Queue creation window. I think they have to go somewhere (because
> the
> > > is a mix of actions on a single Kafka topic and we can't stall other
> > > actions while waiting for a Queue to be created for a new action), but
> if
> > > you don't know yet which Scheduler is going to win the race to be a
> > Leader
> > > how do you know where to put them?
> > >
> > > --dave
> >
> >
> >
> > --
> > Matt Sicker <bo...@gmail.com>
> >
>

Re: New architecture proposal

Posted by Dascalita Dragos <dd...@gmail.com>.
Hi Dominic,
Thanks for sharing your ideas. IIUC (and pls keep me honest), the goal of
the new design is to improve activation performance. I personally love
this; performance is a critical non-functional feature of any FaaS system.

There’s something I’d like to call out: the management of containers in a
FaaS system could be compared to a JVM. A JVM allocates objects in memory,
and GC them. A FaaS system allocates containers to run actions, and it GCs
them when they become idle. If we could look at OW's scheduling from this
perspective, we could reuse the proven patterns in the JVM vs inventing
something new. I’d be interested on any GC implications in the new design,
meaning how idle actions get removed, and how is that orchestrated.

Thanks,
dragos


On Thu, Apr 4, 2019 at 8:40 AM Matt Sicker <bo...@gmail.com> wrote:

> Would it make sense to define an OpenWhisk Improvement/Enhancement
> Propoposal or similar that various other Apache projects do? We could
> call them WHIPs or something. :)
>
> On Thu, 4 Apr 2019 at 09:09, David P Grove <gr...@us.ibm.com> wrote:
> >
> >
> > Dominic Kim <st...@gmail.com> wrote on 04/04/2019 04:37:19 AM:
> > >
> > > I have proposed a new architecture.
> > > https://cwiki.apache.org/confluence/display/OPENWHISK/New+architecture
> > +proposal
> > >
> > > It includes many controversial agendas and breaking changes.
> > > So I would like to form a general consensus on them.
> > >
> >
> > Hi Dominic,
> >
> >         There's much to like about the proposal.  Thank you for writing
> it
> > up.
> >
> >         One meta-comment is that the work will have to be done in a way
> so
> > there are no actual "breaking changes".  It has to be possible to
> continue
> > to configure the system using the existing architectures while this work
> > proceeds.  I would expect this could be done via a new LoadBalancer and
> > some deployment options (similar to how Lean OpenWhisk was done).  If
> work
> > needs to be done to generalize the LoadBalancer SPI, that could be done
> > early in the process.
> >
> >         On the proposal itself, I wonder if the complexity of
> Leader/Follower
> > is actually needed?  If a Scheduler crashes, it could be restarted and
> then
> > resume handling its assigned load.  I think there should be enough
> > information in etcd for it to recover its current set of assigned
> > ContainerProxys and carry on.   Activations in its in memory queues would
> > be lost (bigger blast radius than the current architecture), but I don't
> > see that the Leader/Follower changes that (seems way too expensive to be
> > replicating every activation in the Follower Queues).   The
> Leader/Follower
> > would allow for shorter downtime for those actions assigned to the downed
> > Scheduler, but at the cost of significant complexity.  Is it worth it?
> >
> >         Perhaps related to the Leader/Follower, its not clear to me how
> > activation messages are being pulled from the action topic in Kafka
> during
> > the Queue creation window. I think they have to go somewhere (because the
> > is a mix of actions on a single Kafka topic and we can't stall other
> > actions while waiting for a Queue to be created for a new action), but if
> > you don't know yet which Scheduler is going to win the race to be a
> Leader
> > how do you know where to put them?
> >
> > --dave
>
>
>
> --
> Matt Sicker <bo...@gmail.com>
>

Re: New architecture proposal

Posted by Matt Sicker <bo...@gmail.com>.
Would it make sense to define an OpenWhisk Improvement/Enhancement
Propoposal or similar that various other Apache projects do? We could
call them WHIPs or something. :)

On Thu, 4 Apr 2019 at 09:09, David P Grove <gr...@us.ibm.com> wrote:
>
>
> Dominic Kim <st...@gmail.com> wrote on 04/04/2019 04:37:19 AM:
> >
> > I have proposed a new architecture.
> > https://cwiki.apache.org/confluence/display/OPENWHISK/New+architecture
> +proposal
> >
> > It includes many controversial agendas and breaking changes.
> > So I would like to form a general consensus on them.
> >
>
> Hi Dominic,
>
>         There's much to like about the proposal.  Thank you for writing it
> up.
>
>         One meta-comment is that the work will have to be done in a way so
> there are no actual "breaking changes".  It has to be possible to continue
> to configure the system using the existing architectures while this work
> proceeds.  I would expect this could be done via a new LoadBalancer and
> some deployment options (similar to how Lean OpenWhisk was done).  If work
> needs to be done to generalize the LoadBalancer SPI, that could be done
> early in the process.
>
>         On the proposal itself, I wonder if the complexity of Leader/Follower
> is actually needed?  If a Scheduler crashes, it could be restarted and then
> resume handling its assigned load.  I think there should be enough
> information in etcd for it to recover its current set of assigned
> ContainerProxys and carry on.   Activations in its in memory queues would
> be lost (bigger blast radius than the current architecture), but I don't
> see that the Leader/Follower changes that (seems way too expensive to be
> replicating every activation in the Follower Queues).   The Leader/Follower
> would allow for shorter downtime for those actions assigned to the downed
> Scheduler, but at the cost of significant complexity.  Is it worth it?
>
>         Perhaps related to the Leader/Follower, its not clear to me how
> activation messages are being pulled from the action topic in Kafka during
> the Queue creation window. I think they have to go somewhere (because the
> is a mix of actions on a single Kafka topic and we can't stall other
> actions while waiting for a Queue to be created for a new action), but if
> you don't know yet which Scheduler is going to win the race to be a Leader
> how do you know where to put them?
>
> --dave



-- 
Matt Sicker <bo...@gmail.com>

Re: New architecture proposal

Posted by David P Grove <gr...@us.ibm.com>.
Dominic Kim <st...@gmail.com> wrote on 04/04/2019 04:37:19 AM:
>
> I have proposed a new architecture.
> https://cwiki.apache.org/confluence/display/OPENWHISK/New+architecture
+proposal
>
> It includes many controversial agendas and breaking changes.
> So I would like to form a general consensus on them.
>

Hi Dominic,

	There's much to like about the proposal.  Thank you for writing it
up.

	One meta-comment is that the work will have to be done in a way so
there are no actual "breaking changes".  It has to be possible to continue
to configure the system using the existing architectures while this work
proceeds.  I would expect this could be done via a new LoadBalancer and
some deployment options (similar to how Lean OpenWhisk was done).  If work
needs to be done to generalize the LoadBalancer SPI, that could be done
early in the process.

	On the proposal itself, I wonder if the complexity of Leader/Follower
is actually needed?  If a Scheduler crashes, it could be restarted and then
resume handling its assigned load.  I think there should be enough
information in etcd for it to recover its current set of assigned
ContainerProxys and carry on.   Activations in its in memory queues would
be lost (bigger blast radius than the current architecture), but I don't
see that the Leader/Follower changes that (seems way too expensive to be
replicating every activation in the Follower Queues).   The Leader/Follower
would allow for shorter downtime for those actions assigned to the downed
Scheduler, but at the cost of significant complexity.  Is it worth it?

	Perhaps related to the Leader/Follower, its not clear to me how
activation messages are being pulled from the action topic in Kafka during
the Queue creation window. I think they have to go somewhere (because the
is a mix of actions on a single Kafka topic and we can't stall other
actions while waiting for a Queue to be created for a new action), but if
you don't know yet which Scheduler is going to win the race to be a Leader
how do you know where to put them?

--dave