You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by Chesnay Schepler <ch...@apache.org> on 2022/02/09 11:40:37 UTC

[DISCUSS] Drop Jepsen tests

For a few years by now we had a set of Jepsen tests that verify the 
correctness of Flinks coordination layer in the case of process crashes.
In the past it has indeed found issues and thus provided value to the 
project, and in general the core idea of it (and Jepsen for that matter) 
is very sound.

However, so far we neither made attempts to make further use of Jepsen 
(and limited ourselves to very basic tests) nor to familiarize ourselves 
with the tests/jepsen at all.
As a result these tests are difficult to maintain. They (and Jepsen) are 
written in Clojure, which makes debugging, changes and upstreaming 
contributions very difficult.
Additionally, the tests also make use of a very complicated 
(Ververica-internal) terraform+ansible setup to spin up and tear down 
AWS machines. While it works (and is actually pretty cool), it's 
difficult to adjust because the people who wrote it have left the company.

Why I'm raising this now (and not earlier) is because so far keeping the 
tests running wasn't much of a problem; bump a few dependencies here and 
there and we're good to go.

However, this has changed with the recent upgrade to Zookeeper 3.5, 
which isn't supported by Jepsen out-of-the-box, completely breaking the 
tests. We'd now have to write a new Zookeeper 3.5+ integration for 
Jepsen (again, in Clojure). While I started working on that and could 
likely finish it, I started to wonder whether it even makes sense to do 
so, and whether we couldn't invest this time elsewhere.

Let me know what you think.

Re: [DISCUSS] Drop Jepsen tests

Posted by Yang Wang <da...@gmail.com>.

@Austin
We already have some e2e tests[1] that guards k8s deployment(both session
and application, with or without HA).
And I agree with you that network partition could be simulated by K8s
network policy.


[1].
https://github.com/apache/flink/blob/master/flink-end-to-end-tests/test-scripts/test_kubernetes_application_ha.sh

Best,
Yang

Austin Cawley-Edwards <au...@gmail.com> 于2022年2月10日周四 05:12写道：

> Are there e2e tests that run on kubernetes? Perhaps k8s network policies[1]
> would be an option to simulate asymmetric network partitions without
> modifying iptables in a more approachable way?
>
> Austin
>
> [1]:
> https://kubernetes.io/docs/concepts/services-networking/network-policies/
>
>
> On Wed, Feb 9, 2022 at 12:40 PM David Morávek <dm...@apache.org> wrote:
>
> > Network partitions are trickier than simply crashing process. For example
> > these can be asymmetric -> as a TM you're still able to talk to the JM,
> but
> > you're not able to talk to other TMs.
> >
> > In general this could be achieved by manipulating iptables on the host
> > machine (considering we spawn all the processes locally), but not sure if
> > that will solve the "make it less complicated for others to contribute"
> > part :/ Also this kind of test would be executable on nix systems only.
> >
> > I assume that jepsen uses the same approach under the hood.
> >
> > D.
> >
> > On Wed, Feb 9, 2022 at 5:43 PM Chesnay Schepler <ch...@apache.org>
> > wrote:
> >
> > > b/c are part of the same test.
> > >
> > >   * We have a job running,
> > >   * trigger a network partition (failing the job),
> > >   * then crash HDFS (preventing checkpoints and access to the HA
> > >     storageDir),
> > >   * then the partition is resolved and HDFS is started again.
> > >
> > > Conceptually I would think we can replicate this by nuking half the
> > > cluster, crashing HDFS/ZK, and restarting everything.
> > >
> > > On 09/02/2022 17:39, Chesnay Schepler wrote:
> > > > The jepsen tests cover 3 cases:
> > > > a) JM/TM crashes
> > > > b) HDFS namenode crash (aka, can't checkpoint because HDFS is down)
> > > > c) network partitions
> > > >
> > > > a) can (and probably is) reasonably covered by existing ITCases and
> > > > e2e tests
> > > > b) We could probably figure this out ourselves if we wanted to.
> > > > c) is the difficult part.
> > > >
> > > > Note that the tests also only cover yarn (per-job/session) and
> > > > standalone (session) deployments.
> > > >
> > > > On 09/02/2022 17:11, Konstantin Knauf wrote:
> > > >> Thank you for raising this issue. What risks do you see if we drop
> > > >> it? Do
> > > >> you see any cheaper alternative to (partially) mitigate those risks?
> > > >>
> > > >> On Wed, Feb 9, 2022 at 12:40 PM Chesnay Schepler <
> chesnay@apache.org>
> > > >> wrote:
> > > >>
> > > >>> For a few years by now we had a set of Jepsen tests that verify the
> > > >>> correctness of Flinks coordination layer in the case of process
> > > >>> crashes.
> > > >>> In the past it has indeed found issues and thus provided value to
> the
> > > >>> project, and in general the core idea of it (and Jepsen for that
> > > >>> matter)
> > > >>> is very sound.
> > > >>>
> > > >>> However, so far we neither made attempts to make further use of
> > Jepsen
> > > >>> (and limited ourselves to very basic tests) nor to familiarize
> > > >>> ourselves
> > > >>> with the tests/jepsen at all.
> > > >>> As a result these tests are difficult to maintain. They (and
> Jepsen)
> > > >>> are
> > > >>> written in Clojure, which makes debugging, changes and upstreaming
> > > >>> contributions very difficult.
> > > >>> Additionally, the tests also make use of a very complicated
> > > >>> (Ververica-internal) terraform+ansible setup to spin up and tear
> down
> > > >>> AWS machines. While it works (and is actually pretty cool), it's
> > > >>> difficult to adjust because the people who wrote it have left the
> > > >>> company.
> > > >>>
> > > >>> Why I'm raising this now (and not earlier) is because so far
> keeping
> > > >>> the
> > > >>> tests running wasn't much of a problem; bump a few dependencies
> here
> > > >>> and
> > > >>> there and we're good to go.
> > > >>>
> > > >>> However, this has changed with the recent upgrade to Zookeeper 3.5,
> > > >>> which isn't supported by Jepsen out-of-the-box, completely breaking
> > the
> > > >>> tests. We'd now have to write a new Zookeeper 3.5+ integration for
> > > >>> Jepsen (again, in Clojure). While I started working on that and
> could
> > > >>> likely finish it, I started to wonder whether it even makes sense
> to
> > do
> > > >>> so, and whether we couldn't invest this time elsewhere.
> > > >>>
> > > >>> Let me know what you think.
> > > >>>
> > > >>>
> > > >
> > >
> >
>

Re: [DISCUSS] Drop Jepsen tests

Posted by Austin Cawley-Edwards <au...@gmail.com>.

Are there e2e tests that run on kubernetes? Perhaps k8s network policies[1]
would be an option to simulate asymmetric network partitions without
modifying iptables in a more approachable way?

Austin

[1]:
https://kubernetes.io/docs/concepts/services-networking/network-policies/


On Wed, Feb 9, 2022 at 12:40 PM David Morávek <dm...@apache.org> wrote:

> Network partitions are trickier than simply crashing process. For example
> these can be asymmetric -> as a TM you're still able to talk to the JM, but
> you're not able to talk to other TMs.
>
> In general this could be achieved by manipulating iptables on the host
> machine (considering we spawn all the processes locally), but not sure if
> that will solve the "make it less complicated for others to contribute"
> part :/ Also this kind of test would be executable on nix systems only.
>
> I assume that jepsen uses the same approach under the hood.
>
> D.
>
> On Wed, Feb 9, 2022 at 5:43 PM Chesnay Schepler <ch...@apache.org>
> wrote:
>
> > b/c are part of the same test.
> >
> >   * We have a job running,
> >   * trigger a network partition (failing the job),
> >   * then crash HDFS (preventing checkpoints and access to the HA
> >     storageDir),
> >   * then the partition is resolved and HDFS is started again.
> >
> > Conceptually I would think we can replicate this by nuking half the
> > cluster, crashing HDFS/ZK, and restarting everything.
> >
> > On 09/02/2022 17:39, Chesnay Schepler wrote:
> > > The jepsen tests cover 3 cases:
> > > a) JM/TM crashes
> > > b) HDFS namenode crash (aka, can't checkpoint because HDFS is down)
> > > c) network partitions
> > >
> > > a) can (and probably is) reasonably covered by existing ITCases and
> > > e2e tests
> > > b) We could probably figure this out ourselves if we wanted to.
> > > c) is the difficult part.
> > >
> > > Note that the tests also only cover yarn (per-job/session) and
> > > standalone (session) deployments.
> > >
> > > On 09/02/2022 17:11, Konstantin Knauf wrote:
> > >> Thank you for raising this issue. What risks do you see if we drop
> > >> it? Do
> > >> you see any cheaper alternative to (partially) mitigate those risks?
> > >>
> > >> On Wed, Feb 9, 2022 at 12:40 PM Chesnay Schepler <ch...@apache.org>
> > >> wrote:
> > >>
> > >>> For a few years by now we had a set of Jepsen tests that verify the
> > >>> correctness of Flinks coordination layer in the case of process
> > >>> crashes.
> > >>> In the past it has indeed found issues and thus provided value to the
> > >>> project, and in general the core idea of it (and Jepsen for that
> > >>> matter)
> > >>> is very sound.
> > >>>
> > >>> However, so far we neither made attempts to make further use of
> Jepsen
> > >>> (and limited ourselves to very basic tests) nor to familiarize
> > >>> ourselves
> > >>> with the tests/jepsen at all.
> > >>> As a result these tests are difficult to maintain. They (and Jepsen)
> > >>> are
> > >>> written in Clojure, which makes debugging, changes and upstreaming
> > >>> contributions very difficult.
> > >>> Additionally, the tests also make use of a very complicated
> > >>> (Ververica-internal) terraform+ansible setup to spin up and tear down
> > >>> AWS machines. While it works (and is actually pretty cool), it's
> > >>> difficult to adjust because the people who wrote it have left the
> > >>> company.
> > >>>
> > >>> Why I'm raising this now (and not earlier) is because so far keeping
> > >>> the
> > >>> tests running wasn't much of a problem; bump a few dependencies here
> > >>> and
> > >>> there and we're good to go.
> > >>>
> > >>> However, this has changed with the recent upgrade to Zookeeper 3.5,
> > >>> which isn't supported by Jepsen out-of-the-box, completely breaking
> the
> > >>> tests. We'd now have to write a new Zookeeper 3.5+ integration for
> > >>> Jepsen (again, in Clojure). While I started working on that and could
> > >>> likely finish it, I started to wonder whether it even makes sense to
> do
> > >>> so, and whether we couldn't invest this time elsewhere.
> > >>>
> > >>> Let me know what you think.
> > >>>
> > >>>
> > >
> >
>

Re: [DISCUSS] Drop Jepsen tests

Posted by David Morávek <dm...@apache.org>.

Network partitions are trickier than simply crashing process. For example
these can be asymmetric -> as a TM you're still able to talk to the JM, but
you're not able to talk to other TMs.

In general this could be achieved by manipulating iptables on the host
machine (considering we spawn all the processes locally), but not sure if
that will solve the "make it less complicated for others to contribute"
part :/ Also this kind of test would be executable on nix systems only.

I assume that jepsen uses the same approach under the hood.

D.

On Wed, Feb 9, 2022 at 5:43 PM Chesnay Schepler <ch...@apache.org> wrote:

> b/c are part of the same test.
>
>   * We have a job running,
>   * trigger a network partition (failing the job),
>   * then crash HDFS (preventing checkpoints and access to the HA
>     storageDir),
>   * then the partition is resolved and HDFS is started again.
>
> Conceptually I would think we can replicate this by nuking half the
> cluster, crashing HDFS/ZK, and restarting everything.
>
> On 09/02/2022 17:39, Chesnay Schepler wrote:
> > The jepsen tests cover 3 cases:
> > a) JM/TM crashes
> > b) HDFS namenode crash (aka, can't checkpoint because HDFS is down)
> > c) network partitions
> >
> > a) can (and probably is) reasonably covered by existing ITCases and
> > e2e tests
> > b) We could probably figure this out ourselves if we wanted to.
> > c) is the difficult part.
> >
> > Note that the tests also only cover yarn (per-job/session) and
> > standalone (session) deployments.
> >
> > On 09/02/2022 17:11, Konstantin Knauf wrote:
> >> Thank you for raising this issue. What risks do you see if we drop
> >> it? Do
> >> you see any cheaper alternative to (partially) mitigate those risks?
> >>
> >> On Wed, Feb 9, 2022 at 12:40 PM Chesnay Schepler <ch...@apache.org>
> >> wrote:
> >>
> >>> For a few years by now we had a set of Jepsen tests that verify the
> >>> correctness of Flinks coordination layer in the case of process
> >>> crashes.
> >>> In the past it has indeed found issues and thus provided value to the
> >>> project, and in general the core idea of it (and Jepsen for that
> >>> matter)
> >>> is very sound.
> >>>
> >>> However, so far we neither made attempts to make further use of Jepsen
> >>> (and limited ourselves to very basic tests) nor to familiarize
> >>> ourselves
> >>> with the tests/jepsen at all.
> >>> As a result these tests are difficult to maintain. They (and Jepsen)
> >>> are
> >>> written in Clojure, which makes debugging, changes and upstreaming
> >>> contributions very difficult.
> >>> Additionally, the tests also make use of a very complicated
> >>> (Ververica-internal) terraform+ansible setup to spin up and tear down
> >>> AWS machines. While it works (and is actually pretty cool), it's
> >>> difficult to adjust because the people who wrote it have left the
> >>> company.
> >>>
> >>> Why I'm raising this now (and not earlier) is because so far keeping
> >>> the
> >>> tests running wasn't much of a problem; bump a few dependencies here
> >>> and
> >>> there and we're good to go.
> >>>
> >>> However, this has changed with the recent upgrade to Zookeeper 3.5,
> >>> which isn't supported by Jepsen out-of-the-box, completely breaking the
> >>> tests. We'd now have to write a new Zookeeper 3.5+ integration for
> >>> Jepsen (again, in Clojure). While I started working on that and could
> >>> likely finish it, I started to wonder whether it even makes sense to do
> >>> so, and whether we couldn't invest this time elsewhere.
> >>>
> >>> Let me know what you think.
> >>>
> >>>
> >
>

Re: [DISCUSS] Drop Jepsen tests

Posted by Chesnay Schepler <ch...@apache.org>.

b/c are part of the same test.

  * We have a job running,
  * trigger a network partition (failing the job),
  * then crash HDFS (preventing checkpoints and access to the HA
    storageDir),
  * then the partition is resolved and HDFS is started again.

Conceptually I would think we can replicate this by nuking half the 
cluster, crashing HDFS/ZK, and restarting everything.

On 09/02/2022 17:39, Chesnay Schepler wrote:
> The jepsen tests cover 3 cases:
> a) JM/TM crashes
> b) HDFS namenode crash (aka, can't checkpoint because HDFS is down)
> c) network partitions
>
> a) can (and probably is) reasonably covered by existing ITCases and 
> e2e tests
> b) We could probably figure this out ourselves if we wanted to.
> c) is the difficult part.
>
> Note that the tests also only cover yarn (per-job/session) and 
> standalone (session) deployments.
>
> On 09/02/2022 17:11, Konstantin Knauf wrote:
>> Thank you for raising this issue. What risks do you see if we drop 
>> it? Do
>> you see any cheaper alternative to (partially) mitigate those risks?
>>
>> On Wed, Feb 9, 2022 at 12:40 PM Chesnay Schepler <ch...@apache.org> 
>> wrote:
>>
>>> For a few years by now we had a set of Jepsen tests that verify the
>>> correctness of Flinks coordination layer in the case of process 
>>> crashes.
>>> In the past it has indeed found issues and thus provided value to the
>>> project, and in general the core idea of it (and Jepsen for that 
>>> matter)
>>> is very sound.
>>>
>>> However, so far we neither made attempts to make further use of Jepsen
>>> (and limited ourselves to very basic tests) nor to familiarize 
>>> ourselves
>>> with the tests/jepsen at all.
>>> As a result these tests are difficult to maintain. They (and Jepsen) 
>>> are
>>> written in Clojure, which makes debugging, changes and upstreaming
>>> contributions very difficult.
>>> Additionally, the tests also make use of a very complicated
>>> (Ververica-internal) terraform+ansible setup to spin up and tear down
>>> AWS machines. While it works (and is actually pretty cool), it's
>>> difficult to adjust because the people who wrote it have left the 
>>> company.
>>>
>>> Why I'm raising this now (and not earlier) is because so far keeping 
>>> the
>>> tests running wasn't much of a problem; bump a few dependencies here 
>>> and
>>> there and we're good to go.
>>>
>>> However, this has changed with the recent upgrade to Zookeeper 3.5,
>>> which isn't supported by Jepsen out-of-the-box, completely breaking the
>>> tests. We'd now have to write a new Zookeeper 3.5+ integration for
>>> Jepsen (again, in Clojure). While I started working on that and could
>>> likely finish it, I started to wonder whether it even makes sense to do
>>> so, and whether we couldn't invest this time elsewhere.
>>>
>>> Let me know what you think.
>>>
>>>
>

Re: [DISCUSS] Drop Jepsen tests

Posted by Chesnay Schepler <ch...@apache.org>.

The jepsen tests cover 3 cases:
a) JM/TM crashes
b) HDFS namenode crash (aka, can't checkpoint because HDFS is down)
c) network partitions

a) can (and probably is) reasonably covered by existing ITCases and e2e 
tests
b) We could probably figure this out ourselves if we wanted to.
c) is the difficult part.

Note that the tests also only cover yarn (per-job/session) and 
standalone (session) deployments.

On 09/02/2022 17:11, Konstantin Knauf wrote:
> Thank you for raising this issue. What risks do you see if we drop it? Do
> you see any cheaper alternative to (partially) mitigate those risks?
>
> On Wed, Feb 9, 2022 at 12:40 PM Chesnay Schepler <ch...@apache.org> wrote:
>
>> For a few years by now we had a set of Jepsen tests that verify the
>> correctness of Flinks coordination layer in the case of process crashes.
>> In the past it has indeed found issues and thus provided value to the
>> project, and in general the core idea of it (and Jepsen for that matter)
>> is very sound.
>>
>> However, so far we neither made attempts to make further use of Jepsen
>> (and limited ourselves to very basic tests) nor to familiarize ourselves
>> with the tests/jepsen at all.
>> As a result these tests are difficult to maintain. They (and Jepsen) are
>> written in Clojure, which makes debugging, changes and upstreaming
>> contributions very difficult.
>> Additionally, the tests also make use of a very complicated
>> (Ververica-internal) terraform+ansible setup to spin up and tear down
>> AWS machines. While it works (and is actually pretty cool), it's
>> difficult to adjust because the people who wrote it have left the company.
>>
>> Why I'm raising this now (and not earlier) is because so far keeping the
>> tests running wasn't much of a problem; bump a few dependencies here and
>> there and we're good to go.
>>
>> However, this has changed with the recent upgrade to Zookeeper 3.5,
>> which isn't supported by Jepsen out-of-the-box, completely breaking the
>> tests. We'd now have to write a new Zookeeper 3.5+ integration for
>> Jepsen (again, in Clojure). While I started working on that and could
>> likely finish it, I started to wonder whether it even makes sense to do
>> so, and whether we couldn't invest this time elsewhere.
>>
>> Let me know what you think.
>>
>>

Re: [DISCUSS] Drop Jepsen tests

Posted by Konstantin Knauf <kn...@apache.org>.

Thank you for raising this issue. What risks do you see if we drop it? Do
you see any cheaper alternative to (partially) mitigate those risks?

On Wed, Feb 9, 2022 at 12:40 PM Chesnay Schepler <ch...@apache.org> wrote:

> For a few years by now we had a set of Jepsen tests that verify the
> correctness of Flinks coordination layer in the case of process crashes.
> In the past it has indeed found issues and thus provided value to the
> project, and in general the core idea of it (and Jepsen for that matter)
> is very sound.
>
> However, so far we neither made attempts to make further use of Jepsen
> (and limited ourselves to very basic tests) nor to familiarize ourselves
> with the tests/jepsen at all.
> As a result these tests are difficult to maintain. They (and Jepsen) are
> written in Clojure, which makes debugging, changes and upstreaming
> contributions very difficult.
> Additionally, the tests also make use of a very complicated
> (Ververica-internal) terraform+ansible setup to spin up and tear down
> AWS machines. While it works (and is actually pretty cool), it's
> difficult to adjust because the people who wrote it have left the company.
>
> Why I'm raising this now (and not earlier) is because so far keeping the
> tests running wasn't much of a problem; bump a few dependencies here and
> there and we're good to go.
>
> However, this has changed with the recent upgrade to Zookeeper 3.5,
> which isn't supported by Jepsen out-of-the-box, completely breaking the
> tests. We'd now have to write a new Zookeeper 3.5+ integration for
> Jepsen (again, in Clojure). While I started working on that and could
> likely finish it, I started to wonder whether it even makes sense to do
> so, and whether we couldn't invest this time elsewhere.
>
> Let me know what you think.
>
>

-- 

Konstantin Knauf

https://twitter.com/snntrable

https://github.com/knaufk