You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by Robert Metzger <rm...@apache.org> on 2020/05/14 06:38:30 UTC

[NOTICE] Azure Pipelines Status

Hi all,

tl;dr: I will have to cancel some E2E test executions of pull requests
because we have reached the capacity limit of Flink's Azure Pipelines
account.

Long version: We have two types of agent pools in Azure Pipelines:
Microsoft-hosted VMs and Alibaba-hosted Docker environment.
In the Microsoft VMs, we are running the E2E tests, because we have an
environment that will always be destroyed after each execution (and the E2E
tests often leave dangling docker containers, processes etc.; and they
modify files in system directories)
In the Alibaba-hosted Docker environment, we are compiling and testing the
regular Maven tests.

We only have 10 Microsoft-hosted VMs available, and each E2E execution
takes around 3.5 hours. That means we have a daily capacity of ~70 E2E
tests a day.
On Tuesday, we had 110 builds, on Wednesday 98 builds.
Because of this, I will (manually) cancel some E2E test executions for pull
requests. If I see that a PR is explicitly changing something on E2E tests,
I will keep it. If I see that a PR is a docs change, has other test
failures etc., I will cancel the E2E execution.

If you want to verify that the E2E tests are passing for your own changes,
you can set up Azure Pipelines for your GitHub account, it's free and works
quite well. Here's a tutorial:
https://cwiki.apache.org/confluence/display/FLINK/Azure+Pipelines#AzurePipelines-Tutorial:SettingupAzurePipelinesforaforkoftheFlinkrepository

What can we do to avoid this situation in the future?
Sadly, Microsoft does not allow to buy additional processing slots for open
source projects [1]. However, I'm in touch with a product manager at
Microsoft who promised me (yesterday) to increase the limit for us.

In the Alibaba environment, we have 80 slots available, and usually no
capacity constraints. This means we don't need to make compromises there.

Sorry for this inconvenience.

Best,
Robert

PS: I'm considering keeping this thread as a permanent "status update"
thread for Azure Pipelines

[1]
https://developercommunity.visualstudio.com/content/problem/1028884/additionally-purchased-microsoft-hosted-build-agen.html

Re: [NOTICE] Azure Pipelines Status

Posted by Robert Metzger <rm...@apache.org>.

Microsoft has now doubled our CI capacity (to 20 concurrent VMs for
executing e2e tests).
If the e2e test execution is normalized tomorrow, I will revert the hotfix,
enabling e2e tests on PRs again.

Sorry for the back and forth.

On Tue, May 19, 2020 at 3:11 PM Robert Metzger <rm...@apache.org> wrote:

> Microsoft has not increased our capacity yet (even though it was promised
> to me yesterday again).
>
> I have now merged a hotfix disabling the e2e test execution on pull
> requests to have enough capacity on master.
> Please run e2e tests using your private Azure accounts. Thanks for your
> understanding!
>
> Best,
> Robert
>
>
> On Thu, May 14, 2020 at 11:23 AM Robert Metzger <rm...@apache.org>
> wrote:
>
>> Roughly speaking, I see the following problematic areas (I have initially
>> tried running the E2E tests on those machines):
>>
>> a) e2e tests starting Docker images (including Kubernetes). Since the
>> tests on the Ali infra are running in docker themselves, we need to adjust
>> the test scripts (which is not trivial, because both containers need to be
>> in the same network, and the volume mount paths are different)
>>
>> b) tests that modify the underlying file system: common_kubernetes.sh
>> installs stuff in "/usr/local/bin/". (Now that I think about it, it's not a
>> problem in the docker environment).
>>
>> c) Tests that don't clean up properly when failing. IIRC I saw leftover
>> docker containers by test_streaming_kinesis.sh when I was trying to run the
>> E2E tests on the Ali machines.
>>
>> And then there pull requests that propose changes to the e2e scripts that
>> mess something up :)
>> We certainly need to isolate the e2e test execution somehow. Maybe we
>> could launch VMs on the Ali machines for running the E2Es? (Using Vagrant)
>>
>> If Microsoft is not going to provide us with more test capacity, I will
>> evaluate other options for the E2E tests.
>>
>>
>> On Thu, May 14, 2020 at 10:36 AM Till Rohrmann <tr...@apache.org>
>> wrote:
>>
>>> Thanks for the update Robert.
>>>
>>> One idea to make the e2e also run on the Alibaba infrastructure would be
>>> to
>>> ensure that e2e tests clean up after they have run. Do we know which e2e
>>> tests don't do this properly?
>>>
>>> Cheers,
>>> Till
>>>
>>> On Thu, May 14, 2020 at 8:38 AM Robert Metzger <rm...@apache.org>
>>> wrote:
>>>
>>> > Hi all,
>>> >
>>> > tl;dr: I will have to cancel some E2E test executions of pull requests
>>> > because we have reached the capacity limit of Flink's Azure Pipelines
>>> > account.
>>> >
>>> > Long version: We have two types of agent pools in Azure Pipelines:
>>> > Microsoft-hosted VMs and Alibaba-hosted Docker environment.
>>> > In the Microsoft VMs, we are running the E2E tests, because we have an
>>> > environment that will always be destroyed after each execution (and
>>> the E2E
>>> > tests often leave dangling docker containers, processes etc.; and they
>>> > modify files in system directories)
>>> > In the Alibaba-hosted Docker environment, we are compiling and testing
>>> the
>>> > regular Maven tests.
>>> >
>>> > We only have 10 Microsoft-hosted VMs available, and each E2E execution
>>> > takes around 3.5 hours. That means we have a daily capacity of ~70 E2E
>>> > tests a day.
>>> > On Tuesday, we had 110 builds, on Wednesday 98 builds.
>>> > Because of this, I will (manually) cancel some E2E test executions for
>>> pull
>>> > requests. If I see that a PR is explicitly changing something on E2E
>>> tests,
>>> > I will keep it. If I see that a PR is a docs change, has other test
>>> > failures etc., I will cancel the E2E execution.
>>> >
>>> > If you want to verify that the E2E tests are passing for your own
>>> changes,
>>> > you can set up Azure Pipelines for your GitHub account, it's free and
>>> works
>>> > quite well. Here's a tutorial:
>>> >
>>> >
>>> https://cwiki.apache.org/confluence/display/FLINK/Azure+Pipelines#AzurePipelines-Tutorial:SettingupAzurePipelinesforaforkoftheFlinkrepository
>>> >
>>> > What can we do to avoid this situation in the future?
>>> > Sadly, Microsoft does not allow to buy additional processing slots for
>>> open
>>> > source projects [1]. However, I'm in touch with a product manager at
>>> > Microsoft who promised me (yesterday) to increase the limit for us.
>>> >
>>> > In the Alibaba environment, we have 80 slots available, and usually no
>>> > capacity constraints. This means we don't need to make compromises
>>> there.
>>> >
>>> > Sorry for this inconvenience.
>>> >
>>> > Best,
>>> > Robert
>>> >
>>> > PS: I'm considering keeping this thread as a permanent "status update"
>>> > thread for Azure Pipelines
>>> >
>>> > [1]
>>> >
>>> >
>>> https://developercommunity.visualstudio.com/content/problem/1028884/additionally-purchased-microsoft-hosted-build-agen.html
>>> >
>>>
>>

Re: [NOTICE] Azure Pipelines Status

Posted by Robert Metzger <rm...@apache.org>.

Microsoft has not increased our capacity yet (even though it was promised
to me yesterday again).

I have now merged a hotfix disabling the e2e test execution on pull
requests to have enough capacity on master.
Please run e2e tests using your private Azure accounts. Thanks for your
understanding!

Best,
Robert


On Thu, May 14, 2020 at 11:23 AM Robert Metzger <rm...@apache.org> wrote:

> Roughly speaking, I see the following problematic areas (I have initially
> tried running the E2E tests on those machines):
>
> a) e2e tests starting Docker images (including Kubernetes). Since the
> tests on the Ali infra are running in docker themselves, we need to adjust
> the test scripts (which is not trivial, because both containers need to be
> in the same network, and the volume mount paths are different)
>
> b) tests that modify the underlying file system: common_kubernetes.sh
> installs stuff in "/usr/local/bin/". (Now that I think about it, it's not a
> problem in the docker environment).
>
> c) Tests that don't clean up properly when failing. IIRC I saw leftover
> docker containers by test_streaming_kinesis.sh when I was trying to run the
> E2E tests on the Ali machines.
>
> And then there pull requests that propose changes to the e2e scripts that
> mess something up :)
> We certainly need to isolate the e2e test execution somehow. Maybe we
> could launch VMs on the Ali machines for running the E2Es? (Using Vagrant)
>
> If Microsoft is not going to provide us with more test capacity, I will
> evaluate other options for the E2E tests.
>
>
> On Thu, May 14, 2020 at 10:36 AM Till Rohrmann <tr...@apache.org>
> wrote:
>
>> Thanks for the update Robert.
>>
>> One idea to make the e2e also run on the Alibaba infrastructure would be
>> to
>> ensure that e2e tests clean up after they have run. Do we know which e2e
>> tests don't do this properly?
>>
>> Cheers,
>> Till
>>
>> On Thu, May 14, 2020 at 8:38 AM Robert Metzger <rm...@apache.org>
>> wrote:
>>
>> > Hi all,
>> >
>> > tl;dr: I will have to cancel some E2E test executions of pull requests
>> > because we have reached the capacity limit of Flink's Azure Pipelines
>> > account.
>> >
>> > Long version: We have two types of agent pools in Azure Pipelines:
>> > Microsoft-hosted VMs and Alibaba-hosted Docker environment.
>> > In the Microsoft VMs, we are running the E2E tests, because we have an
>> > environment that will always be destroyed after each execution (and the
>> E2E
>> > tests often leave dangling docker containers, processes etc.; and they
>> > modify files in system directories)
>> > In the Alibaba-hosted Docker environment, we are compiling and testing
>> the
>> > regular Maven tests.
>> >
>> > We only have 10 Microsoft-hosted VMs available, and each E2E execution
>> > takes around 3.5 hours. That means we have a daily capacity of ~70 E2E
>> > tests a day.
>> > On Tuesday, we had 110 builds, on Wednesday 98 builds.
>> > Because of this, I will (manually) cancel some E2E test executions for
>> pull
>> > requests. If I see that a PR is explicitly changing something on E2E
>> tests,
>> > I will keep it. If I see that a PR is a docs change, has other test
>> > failures etc., I will cancel the E2E execution.
>> >
>> > If you want to verify that the E2E tests are passing for your own
>> changes,
>> > you can set up Azure Pipelines for your GitHub account, it's free and
>> works
>> > quite well. Here's a tutorial:
>> >
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/Azure+Pipelines#AzurePipelines-Tutorial:SettingupAzurePipelinesforaforkoftheFlinkrepository
>> >
>> > What can we do to avoid this situation in the future?
>> > Sadly, Microsoft does not allow to buy additional processing slots for
>> open
>> > source projects [1]. However, I'm in touch with a product manager at
>> > Microsoft who promised me (yesterday) to increase the limit for us.
>> >
>> > In the Alibaba environment, we have 80 slots available, and usually no
>> > capacity constraints. This means we don't need to make compromises
>> there.
>> >
>> > Sorry for this inconvenience.
>> >
>> > Best,
>> > Robert
>> >
>> > PS: I'm considering keeping this thread as a permanent "status update"
>> > thread for Azure Pipelines
>> >
>> > [1]
>> >
>> >
>> https://developercommunity.visualstudio.com/content/problem/1028884/additionally-purchased-microsoft-hosted-build-agen.html
>> >
>>
>

Re: [NOTICE] Azure Pipelines Status

Posted by Robert Metzger <rm...@apache.org>.

Roughly speaking, I see the following problematic areas (I have initially
tried running the E2E tests on those machines):

a) e2e tests starting Docker images (including Kubernetes). Since the tests
on the Ali infra are running in docker themselves, we need to adjust the
test scripts (which is not trivial, because both containers need to be in
the same network, and the volume mount paths are different)

b) tests that modify the underlying file system: common_kubernetes.sh
installs stuff in "/usr/local/bin/". (Now that I think about it, it's not a
problem in the docker environment).

c) Tests that don't clean up properly when failing. IIRC I saw leftover
docker containers by test_streaming_kinesis.sh when I was trying to run the
E2E tests on the Ali machines.

And then there pull requests that propose changes to the e2e scripts that
mess something up :)
We certainly need to isolate the e2e test execution somehow. Maybe we could
launch VMs on the Ali machines for running the E2Es? (Using Vagrant)

If Microsoft is not going to provide us with more test capacity, I will
evaluate other options for the E2E tests.

On Thu, May 14, 2020 at 10:36 AM Till Rohrmann <tr...@apache.org> wrote:

> Thanks for the update Robert.
>
> One idea to make the e2e also run on the Alibaba infrastructure would be to
> ensure that e2e tests clean up after they have run. Do we know which e2e
> tests don't do this properly?
>
> Cheers,
> Till
>
> On Thu, May 14, 2020 at 8:38 AM Robert Metzger <rm...@apache.org>
> wrote:
>
> > Hi all,
> >
> > tl;dr: I will have to cancel some E2E test executions of pull requests
> > because we have reached the capacity limit of Flink's Azure Pipelines
> > account.
> >
> > Long version: We have two types of agent pools in Azure Pipelines:
> > Microsoft-hosted VMs and Alibaba-hosted Docker environment.
> > In the Microsoft VMs, we are running the E2E tests, because we have an
> > environment that will always be destroyed after each execution (and the
> E2E
> > tests often leave dangling docker containers, processes etc.; and they
> > modify files in system directories)
> > In the Alibaba-hosted Docker environment, we are compiling and testing
> the
> > regular Maven tests.
> >
> > We only have 10 Microsoft-hosted VMs available, and each E2E execution
> > takes around 3.5 hours. That means we have a daily capacity of ~70 E2E
> > tests a day.
> > On Tuesday, we had 110 builds, on Wednesday 98 builds.
> > Because of this, I will (manually) cancel some E2E test executions for
> pull
> > requests. If I see that a PR is explicitly changing something on E2E
> tests,
> > I will keep it. If I see that a PR is a docs change, has other test
> > failures etc., I will cancel the E2E execution.
> >
> > If you want to verify that the E2E tests are passing for your own
> changes,
> > you can set up Azure Pipelines for your GitHub account, it's free and
> works
> > quite well. Here's a tutorial:
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/Azure+Pipelines#AzurePipelines-Tutorial:SettingupAzurePipelinesforaforkoftheFlinkrepository
> >
> > What can we do to avoid this situation in the future?
> > Sadly, Microsoft does not allow to buy additional processing slots for
> open
> > source projects [1]. However, I'm in touch with a product manager at
> > Microsoft who promised me (yesterday) to increase the limit for us.
> >
> > In the Alibaba environment, we have 80 slots available, and usually no
> > capacity constraints. This means we don't need to make compromises there.
> >
> > Sorry for this inconvenience.
> >
> > Best,
> > Robert
> >
> > PS: I'm considering keeping this thread as a permanent "status update"
> > thread for Azure Pipelines
> >
> > [1]
> >
> >
> https://developercommunity.visualstudio.com/content/problem/1028884/additionally-purchased-microsoft-hosted-build-agen.html
> >
>

Re: [NOTICE] Azure Pipelines Status

Posted by Till Rohrmann <tr...@apache.org>.

Thanks for the update Robert.

One idea to make the e2e also run on the Alibaba infrastructure would be to
ensure that e2e tests clean up after they have run. Do we know which e2e
tests don't do this properly?

Cheers,
Till

On Thu, May 14, 2020 at 8:38 AM Robert Metzger <rm...@apache.org> wrote:

> Hi all,
>
> tl;dr: I will have to cancel some E2E test executions of pull requests
> because we have reached the capacity limit of Flink's Azure Pipelines
> account.
>
> Long version: We have two types of agent pools in Azure Pipelines:
> Microsoft-hosted VMs and Alibaba-hosted Docker environment.
> In the Microsoft VMs, we are running the E2E tests, because we have an
> environment that will always be destroyed after each execution (and the E2E
> tests often leave dangling docker containers, processes etc.; and they
> modify files in system directories)
> In the Alibaba-hosted Docker environment, we are compiling and testing the
> regular Maven tests.
>
> We only have 10 Microsoft-hosted VMs available, and each E2E execution
> takes around 3.5 hours. That means we have a daily capacity of ~70 E2E
> tests a day.
> On Tuesday, we had 110 builds, on Wednesday 98 builds.
> Because of this, I will (manually) cancel some E2E test executions for pull
> requests. If I see that a PR is explicitly changing something on E2E tests,
> I will keep it. If I see that a PR is a docs change, has other test
> failures etc., I will cancel the E2E execution.
>
> If you want to verify that the E2E tests are passing for your own changes,
> you can set up Azure Pipelines for your GitHub account, it's free and works
> quite well. Here's a tutorial:
>
> https://cwiki.apache.org/confluence/display/FLINK/Azure+Pipelines#AzurePipelines-Tutorial:SettingupAzurePipelinesforaforkoftheFlinkrepository
>
> What can we do to avoid this situation in the future?
> Sadly, Microsoft does not allow to buy additional processing slots for open
> source projects [1]. However, I'm in touch with a product manager at
> Microsoft who promised me (yesterday) to increase the limit for us.
>
> In the Alibaba environment, we have 80 slots available, and usually no
> capacity constraints. This means we don't need to make compromises there.
>
> Sorry for this inconvenience.
>
> Best,
> Robert
>
> PS: I'm considering keeping this thread as a permanent "status update"
> thread for Azure Pipelines
>
> [1]
>
> https://developercommunity.visualstudio.com/content/problem/1028884/additionally-purchased-microsoft-hosted-build-agen.html
>