You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cassandra.apache.org by "benedict@apache.org" <be...@apache.org> on 2021/07/13 08:20:02 UTC

Re: [DISCUSS] CEP-10: Cluster and Code Simulations

Did anyone have any thoughts on this CEP, or shall I bring it forward for a vote also?

From: benedict@apache.org <be...@apache.org>
Date: Thursday, 3 June 2021 at 20:19
To: dev@cassandra.apache.org <de...@cassandra.apache.org>
Subject: [DISCUSS] CEP-10: Cluster and Code Simulations
Proposal for a mechanism to evaluate whole clusters, or individual classes, with a deterministically pseudorandom ordering of all thread and message events.

https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-10%3A+Cluster+and+Code+Simulations

Evaluating the correctness of distributed systems is hard, as I’m sure every developer on this list appreciates. As the project has matured, we have had to grapple more with the guarantees we provide users for features we develop, and the semantics we promise, particularly around edge-cases between two mechanisms or systems.

This work aims to dramatically reduce the project overhead necessary for delivering a bug-free Cassandra.

The premise is to intercept all relevant events that could be performed in a different order, i.e. primarily message delivery and thread events such as executor submission, signalling of threads, lock acquisition and release, and even volatile reads and writes (to a lesser extent). These events are then scheduled pseudo-randomly (with various restrictions to ensure a valid execution), or in some cases not evaluated at all (to simulate e.g. messages being lost). The result is a repeatable sequential evaluation of a multi-threaded, multi-actor system.

This permits us to evaluate a much broader range of cluster behaviours without any additional development work, permitting us to implement a broad range of property-based and related randomized acceptance tests, without significant developer burden.

The work will apply just as readily to multi-threaded single classes as it will to whole clusters, and will come with a linearizability test for LWTs as well as a unit test for an existing multi-threaded bug that is otherwise hard to exhibit.

To achieve this, significant modifications will be required to the codebase, mostly cleaning up existing abstractions. Specifically, we will need to be able to mock executors, any blocking concurrency primitives, time, filesystem access and internode streaming.

The work is – in large part – already complete, with JIRA and PRs to follow in the coming weeks. Of course, the work is subject to the usual community input and review, so this does not preclude changes to the work (even significant ones, if they are warranted). I know a lot of incoming CEP are likely to be backed up by significant off-list development as a result of the focus on a shippable 4.0. Hopefully this is just a temporary growing pain, particularly as we move towards a shippable trunk.

I hope this work will be of huge value to the project, particularly as we race to catch up on years of limited feature development.

JIRA and PRs will follow, but I wanted to kick-off discussion in advance.

Re: [DISCUSS] CEP-10: Cluster and Code Simulations

Posted by "benedict@apache.org" <be...@apache.org>.
Hi Benjamin,

The concurrency constructs listed are all _blocking_ concurrency primitives, i.e. they put threads to sleep and wake them up. Since the goal of this work is pseudorandom execution of the application, trapping thread events is a central feature.

The ability to mock the file system is only to ensure the execution is _deterministic_. Otherwise a cluster running billions of simulations would be almost useless, as you would not readily be able to reproduce the sequence on a local machine. The execution order is extremely brittle, with even a different patch release of the JVM being able to produce a different sequence of execution (in some cases, at least – no doubt many patch releases do not have ordering impacts).

The best example of this work is the LWT linearizability verifier that will be included with it, which is quite a simple test to put together with the simulator: you simply issue some LWT reads and writes to a cluster, and the simulator intercepts* every message and thread (and in some specific relevant cases, memory access) event, and executes them in pseudorandom order. Each run exhibits unique behaviour, exploring different edge cases in the system. If we were to only intercept message events, we would fail to explore a wide variety of potentially erroneous states in the system – including even those only related to message delivery (in the real world, responses can be received before the thread sending them completes the act of doing so, for instance).


From: Benjamin Lerer <bl...@apache.org>
Date: Tuesday, 13 July 2021 at 09:50
To: dev@cassandra.apache.org <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-10: Cluster and Code Simulations
Hi Benedict, Sam,

Could you describe some of the scenarios that this new framework will allow
us to test ? They might help me to understand the changes required.
The need for the changes around concurrency and file access is not obvious
to me. By consequence, I am guessing that I probably do not fully
understand the goal of the proposal.

Thanks in advance

Benjamin


Le mar. 13 juil. 2021 à 10:37, Sam Tunnicliffe <sa...@beobal.com> a écrit :

> Spoiler alert: I am pretty familiar with the proposal and the off-list
> work that has been done toward it.
>
> From my perspective, I have no qualms about putting this CEP up for a
> vote. Having seen the potential (and to some degree, realised) benefit of
> this proposal, I am
> convinced of its value.
>
> Thanks,
> Sam
>
> > On 13 Jul 2021, at 09:20, benedict@apache.org wrote:
> >
> > Did anyone have any thoughts on this CEP, or shall I bring it forward
> for a vote also?
> >
> > From: benedict@apache.org <be...@apache.org>
> > Date: Thursday, 3 June 2021 at 20:19
> > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > Subject: [DISCUSS] CEP-10: Cluster and Code Simulations
> > Proposal for a mechanism to evaluate whole clusters, or individual
> classes, with a deterministically pseudorandom ordering of all thread and
> message events.
> >
> >
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-10%3A+Cluster+and+Code+Simulations
> >
> > Evaluating the correctness of distributed systems is hard, as I’m sure
> every developer on this list appreciates. As the project has matured, we
> have had to grapple more with the guarantees we provide users for features
> we develop, and the semantics we promise, particularly around edge-cases
> between two mechanisms or systems.
> >
> > This work aims to dramatically reduce the project overhead necessary for
> delivering a bug-free Cassandra.
> >
> > The premise is to intercept all relevant events that could be performed
> in a different order, i.e. primarily message delivery and thread events
> such as executor submission, signalling of threads, lock acquisition and
> release, and even volatile reads and writes (to a lesser extent). These
> events are then scheduled pseudo-randomly (with various restrictions to
> ensure a valid execution), or in some cases not evaluated at all (to
> simulate e.g. messages being lost). The result is a repeatable sequential
> evaluation of a multi-threaded, multi-actor system.
> >
> > This permits us to evaluate a much broader range of cluster behaviours
> without any additional development work, permitting us to implement a broad
> range of property-based and related randomized acceptance tests, without
> significant developer burden.
> >
> > The work will apply just as readily to multi-threaded single classes as
> it will to whole clusters, and will come with a linearizability test for
> LWTs as well as a unit test for an existing multi-threaded bug that is
> otherwise hard to exhibit.
> >
> > To achieve this, significant modifications will be required to the
> codebase, mostly cleaning up existing abstractions. Specifically, we will
> need to be able to mock executors, any blocking concurrency primitives,
> time, filesystem access and internode streaming.
> >
> > The work is – in large part – already complete, with JIRA and PRs to
> follow in the coming weeks. Of course, the work is subject to the usual
> community input and review, so this does not preclude changes to the work
> (even significant ones, if they are warranted). I know a lot of incoming
> CEP are likely to be backed up by significant off-list development as a
> result of the focus on a shippable 4.0. Hopefully this is just a temporary
> growing pain, particularly as we move towards a shippable trunk.
> >
> > I hope this work will be of huge value to the project, particularly as
> we race to catch up on years of limited feature development.
> >
> > JIRA and PRs will follow, but I wanted to kick-off discussion in advance.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: dev-help@cassandra.apache.org
>
>

Re: [DISCUSS] CEP-10: Cluster and Code Simulations

Posted by Benjamin Lerer <bl...@apache.org>.
 Hi Benedict, Sam,

Could you describe some of the scenarios that this new framework will allow
us to test ? They might help me to understand the changes required.
The need for the changes around concurrency and file access is not obvious
to me. By consequence, I am guessing that I probably do not fully
understand the goal of the proposal.

Thanks in advance

Benjamin


Le mar. 13 juil. 2021 à 10:37, Sam Tunnicliffe <sa...@beobal.com> a écrit :

> Spoiler alert: I am pretty familiar with the proposal and the off-list
> work that has been done toward it.
>
> From my perspective, I have no qualms about putting this CEP up for a
> vote. Having seen the potential (and to some degree, realised) benefit of
> this proposal, I am
> convinced of its value.
>
> Thanks,
> Sam
>
> > On 13 Jul 2021, at 09:20, benedict@apache.org wrote:
> >
> > Did anyone have any thoughts on this CEP, or shall I bring it forward
> for a vote also?
> >
> > From: benedict@apache.org <be...@apache.org>
> > Date: Thursday, 3 June 2021 at 20:19
> > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > Subject: [DISCUSS] CEP-10: Cluster and Code Simulations
> > Proposal for a mechanism to evaluate whole clusters, or individual
> classes, with a deterministically pseudorandom ordering of all thread and
> message events.
> >
> >
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-10%3A+Cluster+and+Code+Simulations
> >
> > Evaluating the correctness of distributed systems is hard, as I’m sure
> every developer on this list appreciates. As the project has matured, we
> have had to grapple more with the guarantees we provide users for features
> we develop, and the semantics we promise, particularly around edge-cases
> between two mechanisms or systems.
> >
> > This work aims to dramatically reduce the project overhead necessary for
> delivering a bug-free Cassandra.
> >
> > The premise is to intercept all relevant events that could be performed
> in a different order, i.e. primarily message delivery and thread events
> such as executor submission, signalling of threads, lock acquisition and
> release, and even volatile reads and writes (to a lesser extent). These
> events are then scheduled pseudo-randomly (with various restrictions to
> ensure a valid execution), or in some cases not evaluated at all (to
> simulate e.g. messages being lost). The result is a repeatable sequential
> evaluation of a multi-threaded, multi-actor system.
> >
> > This permits us to evaluate a much broader range of cluster behaviours
> without any additional development work, permitting us to implement a broad
> range of property-based and related randomized acceptance tests, without
> significant developer burden.
> >
> > The work will apply just as readily to multi-threaded single classes as
> it will to whole clusters, and will come with a linearizability test for
> LWTs as well as a unit test for an existing multi-threaded bug that is
> otherwise hard to exhibit.
> >
> > To achieve this, significant modifications will be required to the
> codebase, mostly cleaning up existing abstractions. Specifically, we will
> need to be able to mock executors, any blocking concurrency primitives,
> time, filesystem access and internode streaming.
> >
> > The work is – in large part – already complete, with JIRA and PRs to
> follow in the coming weeks. Of course, the work is subject to the usual
> community input and review, so this does not preclude changes to the work
> (even significant ones, if they are warranted). I know a lot of incoming
> CEP are likely to be backed up by significant off-list development as a
> result of the focus on a shippable 4.0. Hopefully this is just a temporary
> growing pain, particularly as we move towards a shippable trunk.
> >
> > I hope this work will be of huge value to the project, particularly as
> we race to catch up on years of limited feature development.
> >
> > JIRA and PRs will follow, but I wanted to kick-off discussion in advance.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: dev-help@cassandra.apache.org
>
>

Re: [DISCUSS] CEP-10: Cluster and Code Simulations

Posted by Sam Tunnicliffe <sa...@beobal.com>.
Spoiler alert: I am pretty familiar with the proposal and the off-list work that has been done toward it. 

From my perspective, I have no qualms about putting this CEP up for a vote. Having seen the potential (and to some degree, realised) benefit of this proposal, I am
convinced of its value.

Thanks,
Sam

> On 13 Jul 2021, at 09:20, benedict@apache.org wrote:
> 
> Did anyone have any thoughts on this CEP, or shall I bring it forward for a vote also?
> 
> From: benedict@apache.org <be...@apache.org>
> Date: Thursday, 3 June 2021 at 20:19
> To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> Subject: [DISCUSS] CEP-10: Cluster and Code Simulations
> Proposal for a mechanism to evaluate whole clusters, or individual classes, with a deterministically pseudorandom ordering of all thread and message events.
> 
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-10%3A+Cluster+and+Code+Simulations
> 
> Evaluating the correctness of distributed systems is hard, as I’m sure every developer on this list appreciates. As the project has matured, we have had to grapple more with the guarantees we provide users for features we develop, and the semantics we promise, particularly around edge-cases between two mechanisms or systems.
> 
> This work aims to dramatically reduce the project overhead necessary for delivering a bug-free Cassandra.
> 
> The premise is to intercept all relevant events that could be performed in a different order, i.e. primarily message delivery and thread events such as executor submission, signalling of threads, lock acquisition and release, and even volatile reads and writes (to a lesser extent). These events are then scheduled pseudo-randomly (with various restrictions to ensure a valid execution), or in some cases not evaluated at all (to simulate e.g. messages being lost). The result is a repeatable sequential evaluation of a multi-threaded, multi-actor system.
> 
> This permits us to evaluate a much broader range of cluster behaviours without any additional development work, permitting us to implement a broad range of property-based and related randomized acceptance tests, without significant developer burden.
> 
> The work will apply just as readily to multi-threaded single classes as it will to whole clusters, and will come with a linearizability test for LWTs as well as a unit test for an existing multi-threaded bug that is otherwise hard to exhibit.
> 
> To achieve this, significant modifications will be required to the codebase, mostly cleaning up existing abstractions. Specifically, we will need to be able to mock executors, any blocking concurrency primitives, time, filesystem access and internode streaming.
> 
> The work is – in large part – already complete, with JIRA and PRs to follow in the coming weeks. Of course, the work is subject to the usual community input and review, so this does not preclude changes to the work (even significant ones, if they are warranted). I know a lot of incoming CEP are likely to be backed up by significant off-list development as a result of the focus on a shippable 4.0. Hopefully this is just a temporary growing pain, particularly as we move towards a shippable trunk.
> 
> I hope this work will be of huge value to the project, particularly as we race to catch up on years of limited feature development.
> 
> JIRA and PRs will follow, but I wanted to kick-off discussion in advance.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
For additional commands, e-mail: dev-help@cassandra.apache.org