You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Rafael Fernandez <rf...@google.com> on 2018/06/30 16:32:11 UTC

Parallelizing test runs

+Reuven Lax <re...@google.com> told me yesterday that he was waiting for
some test to be scheduled and run, and it took 6 hours or so. I would like
to help reduce these wait times by increasing parallelism. I need help
understanding the continuous minimum of what we use. It seems the following
is true:


   - There seems to always be 16 jenkins machines on (16 CPUs each)
   - There seems to be three GKE machines always on (1 CPU each)
   - Most (if not all) unit tests run on 1 machine, and seem to run
   one-at-a-time <-- I think we can safely parallelize this to 20.

With current quotas, if we parallelize to 20 concurrent unit tests, we
still have room for 80 other concurrent dataflow jobs to execute, with 75%
of CPU capacity.

Thoughts? Additional data?

Thanks,
r

Re: Parallelizing test runs

Posted by Mikhail Gryzykhin <mi...@google.com>.
 I don't see difference at first glance and no difference is expected.

We never utilized concurrent jobs originally, because job took ~1 hour and
was triggered once every 6 hours. At some point, I added triggering job
when new commit is available and this started triggering jobs in parallel
for each commit. That is unnecessary overhead for post-commits. Removing
concurrent job runs for post-commits triggers single job for multiple
commits that accumulated during execution of previous job.

I believe you are talking about triggering test cases concurrently withing
single Jenkins job. That was not changed.

--Mikhail

Have feedback <http://go/migryz-feedback>?


On Mon, Aug 6, 2018 at 2:44 PM Lukasz Cwik <lc...@google.com> wrote:

> How much slower did the post commits become after removing concurrency?
>
> On Thu, Aug 2, 2018 at 2:32 PM Mikhail Gryzykhin <mi...@google.com>
> wrote:
>
>> I've disabled concurrency for auto-triggered post-commits job. That
>> should reduce job scheduling considerably.
>>
>> I believe that this change should resolve quota issue we have seen this
>> time. I'll monitor if problem reappears.
>>
>> --Mikhail
>>
>> Have feedback <http://go/migryz-feedback>?
>>
>>
>> On Wed, Aug 1, 2018 at 9:40 AM Pablo Estrada <pa...@google.com> wrote:
>>
>>> It feels to me like a peak of 60 jobs per minute is pretty high. If I
>>> understand correctly, we run up to 20 dataflow jobs in parallel per test
>>> suite? Or what's the number here?
>>>
>>> It is also true that most our tests are simple NeedsRunner tests, that
>>> test a couple elements, so the whole pipeline overhead is on startup. This
>>> may be improved by lumping tests together (though might we lose
>>> debuggability?).  Our average number of jobs is, I hope, muuuch smaller
>>> than 60 per minute...
>>>
>>> With all these considerations, I would lean more towards having a retry
>>> policy as the immediate solution.
>>> -P.
>>>
>>> On Wed, Aug 1, 2018 at 9:07 AM Andrew Pilloud <ap...@google.com>
>>> wrote:
>>>
>>>> I like 1 and 2. How do credentials get into Jenkins? Could we create a
>>>> user per Jenkins host?
>>>>
>>>> On Tue, Jul 31, 2018 at 4:33 PM Reuven Lax <re...@google.com> wrote:
>>>>
>>>>> There was also a proposal to lump multiple tests into a single
>>>>> Dataflow job instead of spinning up a separate Dataflow job for each test.
>>>>>
>>>>> On Tue, Jul 31, 2018 at 4:26 PM Mikhail Gryzykhin <mi...@google.com>
>>>>> wrote:
>>>>>
>>>>>> I synced with Rafael. Below is summary of discussion.
>>>>>>
>>>>>> This quota is CreateRequestsPerMinutePerUser and it has 60 requests
>>>>>> per user by default.
>>>>>>
>>>>>> I've created Jira [BEAM-5053](
>>>>>> https://issues.apache.org/jira/browse/BEAM-5053) for this.
>>>>>>
>>>>>> I see following options we can utilize:
>>>>>> 1. Add retry logic. Although this limits us to 1 dataflow job start
>>>>>> per second for whole Jenkins. In long scale this can also block one test
>>>>>> job if other jobs take all the slots.
>>>>>> 2. Utilize different users to spin Dataflow jobs.
>>>>>> 3. Find way to rise quota limit on Dataflow. By default the field
>>>>>> limits value to 60 requests per minute.
>>>>>> 4. Long run generic suggestion: limit amount of dataflow jobs we spin
>>>>>> up and move tests to the form of unit or component tests.
>>>>>>
>>>>>> Please, fill in any insights or ideas you have on this.
>>>>>>
>>>>>> Regards,
>>>>>> --Mikhail
>>>>>>
>>>>>> Have feedback <http://go/migryz-feedback>?
>>>>>>
>>>>>>
>>>>>> On Tue, Jul 31, 2018 at 3:55 PM Mikhail Gryzykhin <mi...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Everyone,
>>>>>>>
>>>>>>> Seems that we hit quota issue again:
>>>>>>> https://builds.apache.org/job/beam_PostCommit_Go_GradleBuild/553/consoleFull
>>>>>>>
>>>>>>> Can someone share information on how was this triaged last time or
>>>>>>> guide me on possible follow-up actions?
>>>>>>>
>>>>>>> Regards,
>>>>>>> --Mikhail
>>>>>>>
>>>>>>> Have feedback <http://go/migryz-feedback>?
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jul 3, 2018 at 9:12 PM Rafael Fernandez <rf...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Summary for all folks following this story -- and many thanks for
>>>>>>>> explaining configs to me and pointing me to files and such.
>>>>>>>>
>>>>>>>> - Scott made changes to the config and we can now run 3
>>>>>>>> ValidatesRunner.Dataflow in parallel (each run is about 2 hours)
>>>>>>>> - With the latest quota changes, we peaked at ~70% capacity in
>>>>>>>> concurrent Dataflow jobs when running those
>>>>>>>> - I've been keeping an eye on quota peaks for all resources today
>>>>>>>> and have not seen any worryisome limits overall.
>>>>>>>> - Also note there are improvements planned to the
>>>>>>>> ValidatesRunner.Dataflow test so various items get batched and the test
>>>>>>>> itself runs faster -- I believe it's on Alan's radar
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> r
>>>>>>>>
>>>>>>>> On Mon, Jul 2, 2018 at 4:23 PM Rafael Fernandez <
>>>>>>>> rfernand@google.com> wrote:
>>>>>>>>
>>>>>>>>> Done!
>>>>>>>>>
>>>>>>>>> On Mon, Jul 2, 2018 at 4:10 PM Scott Wegner <sc...@apache.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hey Rafael, looks like we need more 'INSTANCE_TEMPLATES' quota
>>>>>>>>>> [1]. Can you take a look? I've filed [BEAM-4722]:
>>>>>>>>>> https://issues.apache.org/jira/browse/BEAM-4722
>>>>>>>>>>
>>>>>>>>>> [1]
>>>>>>>>>> https://github.com/apache/beam/pull/5861#issuecomment-401963630
>>>>>>>>>>
>>>>>>>>>> On Mon, Jul 2, 2018 at 11:33 AM Rafael Fernandez <
>>>>>>>>>> rfernand@google.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> OK, Scott just sent https://github.com/apache/beam/pull/5860 .
>>>>>>>>>>> Quotas should not be a problem, if they are, please file a JIRA under
>>>>>>>>>>> gcp-quota.
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>> r
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Jul 2, 2018 at 10:06 AM Kenneth Knowles <kl...@google.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> One thing that is nice when you do this is to be able to share
>>>>>>>>>>>> your results. Though if all you are sharing is "they passed" then I guess
>>>>>>>>>>>> we don't have to insist on evidence.
>>>>>>>>>>>>
>>>>>>>>>>>> Kenn
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Jul 2, 2018 at 9:25 AM Scott Wegner <sc...@apache.org>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> A few thoughts:
>>>>>>>>>>>>>
>>>>>>>>>>>>> * The Jenkins job getting backed up
>>>>>>>>>>>>> is beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle_PR [1]. Since
>>>>>>>>>>>>> Mikhail refactored Jenkins jobs, this only runs when explicitly requested
>>>>>>>>>>>>> via "Run Dataflow ValidatesRunner", and only has 8 total runs. So this job
>>>>>>>>>>>>> is idle more often than backlogged.
>>>>>>>>>>>>>
>>>>>>>>>>>>> * It's difficult to reason about our exact quota needs because
>>>>>>>>>>>>> Dataflow jobs get launched from various Jenkins jobs that have different
>>>>>>>>>>>>> parallelism configurations. If we have budget, we could enable concurrent
>>>>>>>>>>>>> execution of this job and increase our quota enough to give some breathing
>>>>>>>>>>>>> room. If we do this, I recommend limiting the max concurrency via
>>>>>>>>>>>>> throttleConcurrentBuilds [2] to some reasonable limit.
>>>>>>>>>>>>>
>>>>>>>>>>>>> * This test suite is meant to be an exhaustive post-commit
>>>>>>>>>>>>> validation of Dataflow runner, and tests a lot of different aspects of a
>>>>>>>>>>>>> runner. It would be more efficient to run locally only the tests affected
>>>>>>>>>>>>> by your change. Note that this requires having access to a GCP project with
>>>>>>>>>>>>> billing, but most Dataflow developers probably have access to this already.
>>>>>>>>>>>>> The command for this is:
>>>>>>>>>>>>>
>>>>>>>>>>>>> ./gradlew :beam-runners-google-cloud-dataflow-java:validatesRunner
>>>>>>>>>>>>> -PdataflowProject=myGcpProject -PdataflowTempRoot=gs://myGcsTempRoot
>>>>>>>>>>>>> --tests "org.apache.beam.MyTestClass"
>>>>>>>>>>>>>
>>>>>>>>>>>>> [1]
>>>>>>>>>>>>> https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle_PR/buildTimeTrend
>>>>>>>>>>>>> [2]
>>>>>>>>>>>>> https://jenkinsci.github.io/job-dsl-plugin/#method/javaposse.jobdsl.dsl.jobs.FreeStyleJob.throttleConcurrentBuilds
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Jul 2, 2018 at 8:33 AM Lukasz Cwik <lc...@google.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> The validates runner test parallelism is controlled here and
>>>>>>>>>>>>>> is currently set to be "unlimited":
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> https://github.com/apache/beam/blob/fbfe6ceaea9d99cb1c8964087aafaa2bc2297a03/runners/google-cloud-dataflow-java/build.gradle#L115
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Each test fork is run on a different gradle worker, so the
>>>>>>>>>>>>>> number of parallel test runs is limited to the max number of workers
>>>>>>>>>>>>>> configured which is controlled here:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> https://github.com/apache/beam/blob/fbfe6ceaea9d99cb1c8964087aafaa2bc2297a03/.test-infra/jenkins/job_PostCommit_Java_ValidatesRunner_Dataflow.groovy#L50
>>>>>>>>>>>>>> It is currently configured to 3 * number of CPU cores.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We are already running up to 48 Dataflow jobs in parallel.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, Jun 30, 2018 at 9:51 AM Rafael Fernandez <
>>>>>>>>>>>>>> rfernand@google.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - How many resources to ValidatesRunner tests use?
>>>>>>>>>>>>>>> - Where are those settings?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sat, Jun 30, 2018 at 9:50 AM Reuven Lax <re...@google.com>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The specific issue only affects Dataflow ValidatesRunner
>>>>>>>>>>>>>>>> tests. We currently allow only one of these to run at a time, to control
>>>>>>>>>>>>>>>> usage of Dataflow and of GCE quota. Other types of tests do not suffer from
>>>>>>>>>>>>>>>> this issue.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I would like to see if it's possible to increase Dataflow
>>>>>>>>>>>>>>>> quota so we can run more of these in parallel. It took me 8 hours end to
>>>>>>>>>>>>>>>> end to run these tests (about 6 hours for the run to be scheduled). If
>>>>>>>>>>>>>>>> there was a failure, I would have had to repeat the whole process. In the
>>>>>>>>>>>>>>>> worst case, this process could have taken me days. While this is not as
>>>>>>>>>>>>>>>> pressing as some other issues (as most people don't need to run the
>>>>>>>>>>>>>>>> Dataflow tests on every PR), fixing it would make such changes much easier
>>>>>>>>>>>>>>>> to manage.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Reuven
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sat, Jun 30, 2018 at 9:32 AM Rafael Fernandez <
>>>>>>>>>>>>>>>> rfernand@google.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> +Reuven Lax <re...@google.com> told me yesterday that he
>>>>>>>>>>>>>>>>> was waiting for some test to be scheduled and run, and it took 6 hours or
>>>>>>>>>>>>>>>>> so. I would like to help reduce these wait times by increasing parallelism.
>>>>>>>>>>>>>>>>> I need help understanding the continuous minimum of what we use. It seems
>>>>>>>>>>>>>>>>> the following is true:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    - There seems to always be 16 jenkins machines on (16
>>>>>>>>>>>>>>>>>    CPUs each)
>>>>>>>>>>>>>>>>>    - There seems to be three GKE machines always on (1
>>>>>>>>>>>>>>>>>    CPU each)
>>>>>>>>>>>>>>>>>    - Most (if not all) unit tests run on 1 machine, and
>>>>>>>>>>>>>>>>>    seem to run one-at-a-time <-- I think we can safely parallelize this to 20.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> With current quotas, if we parallelize to 20 concurrent
>>>>>>>>>>>>>>>>> unit tests, we still have room for 80 other concurrent dataflow jobs to
>>>>>>>>>>>>>>>>> execute, with 75% of CPU capacity.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thoughts? Additional data?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> r
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>> Got feedback? go/pabloem-feedback
>>> <https://goto.google.com/pabloem-feedback>
>>>
>>

Re: Parallelizing test runs

Posted by Lukasz Cwik <lc...@google.com>.
How much slower did the post commits become after removing concurrency?

On Thu, Aug 2, 2018 at 2:32 PM Mikhail Gryzykhin <mi...@google.com> wrote:

> I've disabled concurrency for auto-triggered post-commits job. That should
> reduce job scheduling considerably.
>
> I believe that this change should resolve quota issue we have seen this
> time. I'll monitor if problem reappears.
>
> --Mikhail
>
> Have feedback <http://go/migryz-feedback>?
>
>
> On Wed, Aug 1, 2018 at 9:40 AM Pablo Estrada <pa...@google.com> wrote:
>
>> It feels to me like a peak of 60 jobs per minute is pretty high. If I
>> understand correctly, we run up to 20 dataflow jobs in parallel per test
>> suite? Or what's the number here?
>>
>> It is also true that most our tests are simple NeedsRunner tests, that
>> test a couple elements, so the whole pipeline overhead is on startup. This
>> may be improved by lumping tests together (though might we lose
>> debuggability?).  Our average number of jobs is, I hope, muuuch smaller
>> than 60 per minute...
>>
>> With all these considerations, I would lean more towards having a retry
>> policy as the immediate solution.
>> -P.
>>
>> On Wed, Aug 1, 2018 at 9:07 AM Andrew Pilloud <ap...@google.com>
>> wrote:
>>
>>> I like 1 and 2. How do credentials get into Jenkins? Could we create a
>>> user per Jenkins host?
>>>
>>> On Tue, Jul 31, 2018 at 4:33 PM Reuven Lax <re...@google.com> wrote:
>>>
>>>> There was also a proposal to lump multiple tests into a single Dataflow
>>>> job instead of spinning up a separate Dataflow job for each test.
>>>>
>>>> On Tue, Jul 31, 2018 at 4:26 PM Mikhail Gryzykhin <mi...@google.com>
>>>> wrote:
>>>>
>>>>> I synced with Rafael. Below is summary of discussion.
>>>>>
>>>>> This quota is CreateRequestsPerMinutePerUser and it has 60 requests
>>>>> per user by default.
>>>>>
>>>>> I've created Jira [BEAM-5053](
>>>>> https://issues.apache.org/jira/browse/BEAM-5053) for this.
>>>>>
>>>>> I see following options we can utilize:
>>>>> 1. Add retry logic. Although this limits us to 1 dataflow job start
>>>>> per second for whole Jenkins. In long scale this can also block one test
>>>>> job if other jobs take all the slots.
>>>>> 2. Utilize different users to spin Dataflow jobs.
>>>>> 3. Find way to rise quota limit on Dataflow. By default the field
>>>>> limits value to 60 requests per minute.
>>>>> 4. Long run generic suggestion: limit amount of dataflow jobs we spin
>>>>> up and move tests to the form of unit or component tests.
>>>>>
>>>>> Please, fill in any insights or ideas you have on this.
>>>>>
>>>>> Regards,
>>>>> --Mikhail
>>>>>
>>>>> Have feedback <http://go/migryz-feedback>?
>>>>>
>>>>>
>>>>> On Tue, Jul 31, 2018 at 3:55 PM Mikhail Gryzykhin <mi...@google.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Everyone,
>>>>>>
>>>>>> Seems that we hit quota issue again:
>>>>>> https://builds.apache.org/job/beam_PostCommit_Go_GradleBuild/553/consoleFull
>>>>>>
>>>>>> Can someone share information on how was this triaged last time or
>>>>>> guide me on possible follow-up actions?
>>>>>>
>>>>>> Regards,
>>>>>> --Mikhail
>>>>>>
>>>>>> Have feedback <http://go/migryz-feedback>?
>>>>>>
>>>>>>
>>>>>> On Tue, Jul 3, 2018 at 9:12 PM Rafael Fernandez <rf...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Summary for all folks following this story -- and many thanks for
>>>>>>> explaining configs to me and pointing me to files and such.
>>>>>>>
>>>>>>> - Scott made changes to the config and we can now run 3
>>>>>>> ValidatesRunner.Dataflow in parallel (each run is about 2 hours)
>>>>>>> - With the latest quota changes, we peaked at ~70% capacity in
>>>>>>> concurrent Dataflow jobs when running those
>>>>>>> - I've been keeping an eye on quota peaks for all resources today
>>>>>>> and have not seen any worryisome limits overall.
>>>>>>> - Also note there are improvements planned to the
>>>>>>> ValidatesRunner.Dataflow test so various items get batched and the test
>>>>>>> itself runs faster -- I believe it's on Alan's radar
>>>>>>>
>>>>>>> Cheers,
>>>>>>> r
>>>>>>>
>>>>>>> On Mon, Jul 2, 2018 at 4:23 PM Rafael Fernandez <rf...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Done!
>>>>>>>>
>>>>>>>> On Mon, Jul 2, 2018 at 4:10 PM Scott Wegner <sc...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hey Rafael, looks like we need more 'INSTANCE_TEMPLATES' quota
>>>>>>>>> [1]. Can you take a look? I've filed [BEAM-4722]:
>>>>>>>>> https://issues.apache.org/jira/browse/BEAM-4722
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> https://github.com/apache/beam/pull/5861#issuecomment-401963630
>>>>>>>>>
>>>>>>>>> On Mon, Jul 2, 2018 at 11:33 AM Rafael Fernandez <
>>>>>>>>> rfernand@google.com> wrote:
>>>>>>>>>
>>>>>>>>>> OK, Scott just sent https://github.com/apache/beam/pull/5860 .
>>>>>>>>>> Quotas should not be a problem, if they are, please file a JIRA under
>>>>>>>>>> gcp-quota.
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> r
>>>>>>>>>>
>>>>>>>>>> On Mon, Jul 2, 2018 at 10:06 AM Kenneth Knowles <kl...@google.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> One thing that is nice when you do this is to be able to share
>>>>>>>>>>> your results. Though if all you are sharing is "they passed" then I guess
>>>>>>>>>>> we don't have to insist on evidence.
>>>>>>>>>>>
>>>>>>>>>>> Kenn
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Jul 2, 2018 at 9:25 AM Scott Wegner <sc...@apache.org>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> A few thoughts:
>>>>>>>>>>>>
>>>>>>>>>>>> * The Jenkins job getting backed up
>>>>>>>>>>>> is beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle_PR [1]. Since
>>>>>>>>>>>> Mikhail refactored Jenkins jobs, this only runs when explicitly requested
>>>>>>>>>>>> via "Run Dataflow ValidatesRunner", and only has 8 total runs. So this job
>>>>>>>>>>>> is idle more often than backlogged.
>>>>>>>>>>>>
>>>>>>>>>>>> * It's difficult to reason about our exact quota needs because
>>>>>>>>>>>> Dataflow jobs get launched from various Jenkins jobs that have different
>>>>>>>>>>>> parallelism configurations. If we have budget, we could enable concurrent
>>>>>>>>>>>> execution of this job and increase our quota enough to give some breathing
>>>>>>>>>>>> room. If we do this, I recommend limiting the max concurrency via
>>>>>>>>>>>> throttleConcurrentBuilds [2] to some reasonable limit.
>>>>>>>>>>>>
>>>>>>>>>>>> * This test suite is meant to be an exhaustive post-commit
>>>>>>>>>>>> validation of Dataflow runner, and tests a lot of different aspects of a
>>>>>>>>>>>> runner. It would be more efficient to run locally only the tests affected
>>>>>>>>>>>> by your change. Note that this requires having access to a GCP project with
>>>>>>>>>>>> billing, but most Dataflow developers probably have access to this already.
>>>>>>>>>>>> The command for this is:
>>>>>>>>>>>>
>>>>>>>>>>>> ./gradlew :beam-runners-google-cloud-dataflow-java:validatesRunner
>>>>>>>>>>>> -PdataflowProject=myGcpProject -PdataflowTempRoot=gs://myGcsTempRoot
>>>>>>>>>>>> --tests "org.apache.beam.MyTestClass"
>>>>>>>>>>>>
>>>>>>>>>>>> [1]
>>>>>>>>>>>> https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle_PR/buildTimeTrend
>>>>>>>>>>>> [2]
>>>>>>>>>>>> https://jenkinsci.github.io/job-dsl-plugin/#method/javaposse.jobdsl.dsl.jobs.FreeStyleJob.throttleConcurrentBuilds
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Jul 2, 2018 at 8:33 AM Lukasz Cwik <lc...@google.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> The validates runner test parallelism is controlled here and
>>>>>>>>>>>>> is currently set to be "unlimited":
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://github.com/apache/beam/blob/fbfe6ceaea9d99cb1c8964087aafaa2bc2297a03/runners/google-cloud-dataflow-java/build.gradle#L115
>>>>>>>>>>>>>
>>>>>>>>>>>>> Each test fork is run on a different gradle worker, so the
>>>>>>>>>>>>> number of parallel test runs is limited to the max number of workers
>>>>>>>>>>>>> configured which is controlled here:
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://github.com/apache/beam/blob/fbfe6ceaea9d99cb1c8964087aafaa2bc2297a03/.test-infra/jenkins/job_PostCommit_Java_ValidatesRunner_Dataflow.groovy#L50
>>>>>>>>>>>>> It is currently configured to 3 * number of CPU cores.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We are already running up to 48 Dataflow jobs in parallel.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, Jun 30, 2018 at 9:51 AM Rafael Fernandez <
>>>>>>>>>>>>> rfernand@google.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> - How many resources to ValidatesRunner tests use?
>>>>>>>>>>>>>> - Where are those settings?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, Jun 30, 2018 at 9:50 AM Reuven Lax <re...@google.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The specific issue only affects Dataflow ValidatesRunner
>>>>>>>>>>>>>>> tests. We currently allow only one of these to run at a time, to control
>>>>>>>>>>>>>>> usage of Dataflow and of GCE quota. Other types of tests do not suffer from
>>>>>>>>>>>>>>> this issue.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I would like to see if it's possible to increase Dataflow
>>>>>>>>>>>>>>> quota so we can run more of these in parallel. It took me 8 hours end to
>>>>>>>>>>>>>>> end to run these tests (about 6 hours for the run to be scheduled). If
>>>>>>>>>>>>>>> there was a failure, I would have had to repeat the whole process. In the
>>>>>>>>>>>>>>> worst case, this process could have taken me days. While this is not as
>>>>>>>>>>>>>>> pressing as some other issues (as most people don't need to run the
>>>>>>>>>>>>>>> Dataflow tests on every PR), fixing it would make such changes much easier
>>>>>>>>>>>>>>> to manage.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Reuven
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sat, Jun 30, 2018 at 9:32 AM Rafael Fernandez <
>>>>>>>>>>>>>>> rfernand@google.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +Reuven Lax <re...@google.com> told me yesterday that he
>>>>>>>>>>>>>>>> was waiting for some test to be scheduled and run, and it took 6 hours or
>>>>>>>>>>>>>>>> so. I would like to help reduce these wait times by increasing parallelism.
>>>>>>>>>>>>>>>> I need help understanding the continuous minimum of what we use. It seems
>>>>>>>>>>>>>>>> the following is true:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>    - There seems to always be 16 jenkins machines on (16
>>>>>>>>>>>>>>>>    CPUs each)
>>>>>>>>>>>>>>>>    - There seems to be three GKE machines always on (1 CPU
>>>>>>>>>>>>>>>>    each)
>>>>>>>>>>>>>>>>    - Most (if not all) unit tests run on 1 machine, and
>>>>>>>>>>>>>>>>    seem to run one-at-a-time <-- I think we can safely parallelize this to 20.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> With current quotas, if we parallelize to 20 concurrent
>>>>>>>>>>>>>>>> unit tests, we still have room for 80 other concurrent dataflow jobs to
>>>>>>>>>>>>>>>> execute, with 75% of CPU capacity.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thoughts? Additional data?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> r
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>> Got feedback? go/pabloem-feedback
>> <https://goto.google.com/pabloem-feedback>
>>
>

Re: Parallelizing test runs

Posted by Mikhail Gryzykhin <mi...@google.com>.
I've disabled concurrency for auto-triggered post-commits job. That should
reduce job scheduling considerably.

I believe that this change should resolve quota issue we have seen this
time. I'll monitor if problem reappears.

--Mikhail

Have feedback <http://go/migryz-feedback>?


On Wed, Aug 1, 2018 at 9:40 AM Pablo Estrada <pa...@google.com> wrote:

> It feels to me like a peak of 60 jobs per minute is pretty high. If I
> understand correctly, we run up to 20 dataflow jobs in parallel per test
> suite? Or what's the number here?
>
> It is also true that most our tests are simple NeedsRunner tests, that
> test a couple elements, so the whole pipeline overhead is on startup. This
> may be improved by lumping tests together (though might we lose
> debuggability?).  Our average number of jobs is, I hope, muuuch smaller
> than 60 per minute...
>
> With all these considerations, I would lean more towards having a retry
> policy as the immediate solution.
> -P.
>
> On Wed, Aug 1, 2018 at 9:07 AM Andrew Pilloud <ap...@google.com> wrote:
>
>> I like 1 and 2. How do credentials get into Jenkins? Could we create a
>> user per Jenkins host?
>>
>> On Tue, Jul 31, 2018 at 4:33 PM Reuven Lax <re...@google.com> wrote:
>>
>>> There was also a proposal to lump multiple tests into a single Dataflow
>>> job instead of spinning up a separate Dataflow job for each test.
>>>
>>> On Tue, Jul 31, 2018 at 4:26 PM Mikhail Gryzykhin <mi...@google.com>
>>> wrote:
>>>
>>>> I synced with Rafael. Below is summary of discussion.
>>>>
>>>> This quota is CreateRequestsPerMinutePerUser and it has 60 requests per
>>>> user by default.
>>>>
>>>> I've created Jira [BEAM-5053](
>>>> https://issues.apache.org/jira/browse/BEAM-5053) for this.
>>>>
>>>> I see following options we can utilize:
>>>> 1. Add retry logic. Although this limits us to 1 dataflow job start per
>>>> second for whole Jenkins. In long scale this can also block one test job if
>>>> other jobs take all the slots.
>>>> 2. Utilize different users to spin Dataflow jobs.
>>>> 3. Find way to rise quota limit on Dataflow. By default the field
>>>> limits value to 60 requests per minute.
>>>> 4. Long run generic suggestion: limit amount of dataflow jobs we spin
>>>> up and move tests to the form of unit or component tests.
>>>>
>>>> Please, fill in any insights or ideas you have on this.
>>>>
>>>> Regards,
>>>> --Mikhail
>>>>
>>>> Have feedback <http://go/migryz-feedback>?
>>>>
>>>>
>>>> On Tue, Jul 31, 2018 at 3:55 PM Mikhail Gryzykhin <mi...@google.com>
>>>> wrote:
>>>>
>>>>> Hi Everyone,
>>>>>
>>>>> Seems that we hit quota issue again:
>>>>> https://builds.apache.org/job/beam_PostCommit_Go_GradleBuild/553/consoleFull
>>>>>
>>>>> Can someone share information on how was this triaged last time or
>>>>> guide me on possible follow-up actions?
>>>>>
>>>>> Regards,
>>>>> --Mikhail
>>>>>
>>>>> Have feedback <http://go/migryz-feedback>?
>>>>>
>>>>>
>>>>> On Tue, Jul 3, 2018 at 9:12 PM Rafael Fernandez <rf...@google.com>
>>>>> wrote:
>>>>>
>>>>>> Summary for all folks following this story -- and many thanks for
>>>>>> explaining configs to me and pointing me to files and such.
>>>>>>
>>>>>> - Scott made changes to the config and we can now run 3
>>>>>> ValidatesRunner.Dataflow in parallel (each run is about 2 hours)
>>>>>> - With the latest quota changes, we peaked at ~70% capacity in
>>>>>> concurrent Dataflow jobs when running those
>>>>>> - I've been keeping an eye on quota peaks for all resources today and
>>>>>> have not seen any worryisome limits overall.
>>>>>> - Also note there are improvements planned to the
>>>>>> ValidatesRunner.Dataflow test so various items get batched and the test
>>>>>> itself runs faster -- I believe it's on Alan's radar
>>>>>>
>>>>>> Cheers,
>>>>>> r
>>>>>>
>>>>>> On Mon, Jul 2, 2018 at 4:23 PM Rafael Fernandez <rf...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Done!
>>>>>>>
>>>>>>> On Mon, Jul 2, 2018 at 4:10 PM Scott Wegner <sc...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hey Rafael, looks like we need more 'INSTANCE_TEMPLATES' quota [1].
>>>>>>>> Can you take a look? I've filed [BEAM-4722]:
>>>>>>>> https://issues.apache.org/jira/browse/BEAM-4722
>>>>>>>>
>>>>>>>> [1] https://github.com/apache/beam/pull/5861#issuecomment-401963630
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Jul 2, 2018 at 11:33 AM Rafael Fernandez <
>>>>>>>> rfernand@google.com> wrote:
>>>>>>>>
>>>>>>>>> OK, Scott just sent https://github.com/apache/beam/pull/5860 .
>>>>>>>>> Quotas should not be a problem, if they are, please file a JIRA under
>>>>>>>>> gcp-quota.
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> r
>>>>>>>>>
>>>>>>>>> On Mon, Jul 2, 2018 at 10:06 AM Kenneth Knowles <kl...@google.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> One thing that is nice when you do this is to be able to share
>>>>>>>>>> your results. Though if all you are sharing is "they passed" then I guess
>>>>>>>>>> we don't have to insist on evidence.
>>>>>>>>>>
>>>>>>>>>> Kenn
>>>>>>>>>>
>>>>>>>>>> On Mon, Jul 2, 2018 at 9:25 AM Scott Wegner <sc...@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> A few thoughts:
>>>>>>>>>>>
>>>>>>>>>>> * The Jenkins job getting backed up
>>>>>>>>>>> is beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle_PR [1]. Since
>>>>>>>>>>> Mikhail refactored Jenkins jobs, this only runs when explicitly requested
>>>>>>>>>>> via "Run Dataflow ValidatesRunner", and only has 8 total runs. So this job
>>>>>>>>>>> is idle more often than backlogged.
>>>>>>>>>>>
>>>>>>>>>>> * It's difficult to reason about our exact quota needs because
>>>>>>>>>>> Dataflow jobs get launched from various Jenkins jobs that have different
>>>>>>>>>>> parallelism configurations. If we have budget, we could enable concurrent
>>>>>>>>>>> execution of this job and increase our quota enough to give some breathing
>>>>>>>>>>> room. If we do this, I recommend limiting the max concurrency via
>>>>>>>>>>> throttleConcurrentBuilds [2] to some reasonable limit.
>>>>>>>>>>>
>>>>>>>>>>> * This test suite is meant to be an exhaustive post-commit
>>>>>>>>>>> validation of Dataflow runner, and tests a lot of different aspects of a
>>>>>>>>>>> runner. It would be more efficient to run locally only the tests affected
>>>>>>>>>>> by your change. Note that this requires having access to a GCP project with
>>>>>>>>>>> billing, but most Dataflow developers probably have access to this already.
>>>>>>>>>>> The command for this is:
>>>>>>>>>>>
>>>>>>>>>>> ./gradlew :beam-runners-google-cloud-dataflow-java:validatesRunner
>>>>>>>>>>> -PdataflowProject=myGcpProject -PdataflowTempRoot=gs://myGcsTempRoot
>>>>>>>>>>> --tests "org.apache.beam.MyTestClass"
>>>>>>>>>>>
>>>>>>>>>>> [1]
>>>>>>>>>>> https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle_PR/buildTimeTrend
>>>>>>>>>>> [2]
>>>>>>>>>>> https://jenkinsci.github.io/job-dsl-plugin/#method/javaposse.jobdsl.dsl.jobs.FreeStyleJob.throttleConcurrentBuilds
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Jul 2, 2018 at 8:33 AM Lukasz Cwik <lc...@google.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> The validates runner test parallelism is controlled here and is
>>>>>>>>>>>> currently set to be "unlimited":
>>>>>>>>>>>>
>>>>>>>>>>>> https://github.com/apache/beam/blob/fbfe6ceaea9d99cb1c8964087aafaa2bc2297a03/runners/google-cloud-dataflow-java/build.gradle#L115
>>>>>>>>>>>>
>>>>>>>>>>>> Each test fork is run on a different gradle worker, so the
>>>>>>>>>>>> number of parallel test runs is limited to the max number of workers
>>>>>>>>>>>> configured which is controlled here:
>>>>>>>>>>>>
>>>>>>>>>>>> https://github.com/apache/beam/blob/fbfe6ceaea9d99cb1c8964087aafaa2bc2297a03/.test-infra/jenkins/job_PostCommit_Java_ValidatesRunner_Dataflow.groovy#L50
>>>>>>>>>>>> It is currently configured to 3 * number of CPU cores.
>>>>>>>>>>>>
>>>>>>>>>>>> We are already running up to 48 Dataflow jobs in parallel.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, Jun 30, 2018 at 9:51 AM Rafael Fernandez <
>>>>>>>>>>>> rfernand@google.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> - How many resources to ValidatesRunner tests use?
>>>>>>>>>>>>> - Where are those settings?
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, Jun 30, 2018 at 9:50 AM Reuven Lax <re...@google.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> The specific issue only affects Dataflow ValidatesRunner
>>>>>>>>>>>>>> tests. We currently allow only one of these to run at a time, to control
>>>>>>>>>>>>>> usage of Dataflow and of GCE quota. Other types of tests do not suffer from
>>>>>>>>>>>>>> this issue.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I would like to see if it's possible to increase Dataflow
>>>>>>>>>>>>>> quota so we can run more of these in parallel. It took me 8 hours end to
>>>>>>>>>>>>>> end to run these tests (about 6 hours for the run to be scheduled). If
>>>>>>>>>>>>>> there was a failure, I would have had to repeat the whole process. In the
>>>>>>>>>>>>>> worst case, this process could have taken me days. While this is not as
>>>>>>>>>>>>>> pressing as some other issues (as most people don't need to run the
>>>>>>>>>>>>>> Dataflow tests on every PR), fixing it would make such changes much easier
>>>>>>>>>>>>>> to manage.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Reuven
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, Jun 30, 2018 at 9:32 AM Rafael Fernandez <
>>>>>>>>>>>>>> rfernand@google.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> +Reuven Lax <re...@google.com> told me yesterday that he
>>>>>>>>>>>>>>> was waiting for some test to be scheduled and run, and it took 6 hours or
>>>>>>>>>>>>>>> so. I would like to help reduce these wait times by increasing parallelism.
>>>>>>>>>>>>>>> I need help understanding the continuous minimum of what we use. It seems
>>>>>>>>>>>>>>> the following is true:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    - There seems to always be 16 jenkins machines on (16
>>>>>>>>>>>>>>>    CPUs each)
>>>>>>>>>>>>>>>    - There seems to be three GKE machines always on (1 CPU
>>>>>>>>>>>>>>>    each)
>>>>>>>>>>>>>>>    - Most (if not all) unit tests run on 1 machine, and
>>>>>>>>>>>>>>>    seem to run one-at-a-time <-- I think we can safely parallelize this to 20.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> With current quotas, if we parallelize to 20 concurrent unit
>>>>>>>>>>>>>>> tests, we still have room for 80 other concurrent dataflow jobs to execute,
>>>>>>>>>>>>>>> with 75% of CPU capacity.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thoughts? Additional data?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> r
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
> Got feedback? go/pabloem-feedback
> <https://goto.google.com/pabloem-feedback>
>

Re: Parallelizing test runs

Posted by Pablo Estrada <pa...@google.com>.
It feels to me like a peak of 60 jobs per minute is pretty high. If I
understand correctly, we run up to 20 dataflow jobs in parallel per test
suite? Or what's the number here?

It is also true that most our tests are simple NeedsRunner tests, that test
a couple elements, so the whole pipeline overhead is on startup. This may
be improved by lumping tests together (though might we lose
debuggability?).  Our average number of jobs is, I hope, muuuch smaller
than 60 per minute...

With all these considerations, I would lean more towards having a retry
policy as the immediate solution.
-P.

On Wed, Aug 1, 2018 at 9:07 AM Andrew Pilloud <ap...@google.com> wrote:

> I like 1 and 2. How do credentials get into Jenkins? Could we create a
> user per Jenkins host?
>
> On Tue, Jul 31, 2018 at 4:33 PM Reuven Lax <re...@google.com> wrote:
>
>> There was also a proposal to lump multiple tests into a single Dataflow
>> job instead of spinning up a separate Dataflow job for each test.
>>
>> On Tue, Jul 31, 2018 at 4:26 PM Mikhail Gryzykhin <mi...@google.com>
>> wrote:
>>
>>> I synced with Rafael. Below is summary of discussion.
>>>
>>> This quota is CreateRequestsPerMinutePerUser and it has 60 requests per
>>> user by default.
>>>
>>> I've created Jira [BEAM-5053](
>>> https://issues.apache.org/jira/browse/BEAM-5053) for this.
>>>
>>> I see following options we can utilize:
>>> 1. Add retry logic. Although this limits us to 1 dataflow job start per
>>> second for whole Jenkins. In long scale this can also block one test job if
>>> other jobs take all the slots.
>>> 2. Utilize different users to spin Dataflow jobs.
>>> 3. Find way to rise quota limit on Dataflow. By default the field limits
>>> value to 60 requests per minute.
>>> 4. Long run generic suggestion: limit amount of dataflow jobs we spin up
>>> and move tests to the form of unit or component tests.
>>>
>>> Please, fill in any insights or ideas you have on this.
>>>
>>> Regards,
>>> --Mikhail
>>>
>>> Have feedback <http://go/migryz-feedback>?
>>>
>>>
>>> On Tue, Jul 31, 2018 at 3:55 PM Mikhail Gryzykhin <mi...@google.com>
>>> wrote:
>>>
>>>> Hi Everyone,
>>>>
>>>> Seems that we hit quota issue again:
>>>> https://builds.apache.org/job/beam_PostCommit_Go_GradleBuild/553/consoleFull
>>>>
>>>> Can someone share information on how was this triaged last time or
>>>> guide me on possible follow-up actions?
>>>>
>>>> Regards,
>>>> --Mikhail
>>>>
>>>> Have feedback <http://go/migryz-feedback>?
>>>>
>>>>
>>>> On Tue, Jul 3, 2018 at 9:12 PM Rafael Fernandez <rf...@google.com>
>>>> wrote:
>>>>
>>>>> Summary for all folks following this story -- and many thanks for
>>>>> explaining configs to me and pointing me to files and such.
>>>>>
>>>>> - Scott made changes to the config and we can now run 3
>>>>> ValidatesRunner.Dataflow in parallel (each run is about 2 hours)
>>>>> - With the latest quota changes, we peaked at ~70% capacity in
>>>>> concurrent Dataflow jobs when running those
>>>>> - I've been keeping an eye on quota peaks for all resources today and
>>>>> have not seen any worryisome limits overall.
>>>>> - Also note there are improvements planned to the
>>>>> ValidatesRunner.Dataflow test so various items get batched and the test
>>>>> itself runs faster -- I believe it's on Alan's radar
>>>>>
>>>>> Cheers,
>>>>> r
>>>>>
>>>>> On Mon, Jul 2, 2018 at 4:23 PM Rafael Fernandez <rf...@google.com>
>>>>> wrote:
>>>>>
>>>>>> Done!
>>>>>>
>>>>>> On Mon, Jul 2, 2018 at 4:10 PM Scott Wegner <sc...@apache.org> wrote:
>>>>>>
>>>>>>> Hey Rafael, looks like we need more 'INSTANCE_TEMPLATES' quota [1].
>>>>>>> Can you take a look? I've filed [BEAM-4722]:
>>>>>>> https://issues.apache.org/jira/browse/BEAM-4722
>>>>>>>
>>>>>>> [1] https://github.com/apache/beam/pull/5861#issuecomment-401963630
>>>>>>>
>>>>>>> On Mon, Jul 2, 2018 at 11:33 AM Rafael Fernandez <
>>>>>>> rfernand@google.com> wrote:
>>>>>>>
>>>>>>>> OK, Scott just sent https://github.com/apache/beam/pull/5860 .
>>>>>>>> Quotas should not be a problem, if they are, please file a JIRA under
>>>>>>>> gcp-quota.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> r
>>>>>>>>
>>>>>>>> On Mon, Jul 2, 2018 at 10:06 AM Kenneth Knowles <kl...@google.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> One thing that is nice when you do this is to be able to share
>>>>>>>>> your results. Though if all you are sharing is "they passed" then I guess
>>>>>>>>> we don't have to insist on evidence.
>>>>>>>>>
>>>>>>>>> Kenn
>>>>>>>>>
>>>>>>>>> On Mon, Jul 2, 2018 at 9:25 AM Scott Wegner <sc...@apache.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> A few thoughts:
>>>>>>>>>>
>>>>>>>>>> * The Jenkins job getting backed up
>>>>>>>>>> is beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle_PR [1]. Since
>>>>>>>>>> Mikhail refactored Jenkins jobs, this only runs when explicitly requested
>>>>>>>>>> via "Run Dataflow ValidatesRunner", and only has 8 total runs. So this job
>>>>>>>>>> is idle more often than backlogged.
>>>>>>>>>>
>>>>>>>>>> * It's difficult to reason about our exact quota needs because
>>>>>>>>>> Dataflow jobs get launched from various Jenkins jobs that have different
>>>>>>>>>> parallelism configurations. If we have budget, we could enable concurrent
>>>>>>>>>> execution of this job and increase our quota enough to give some breathing
>>>>>>>>>> room. If we do this, I recommend limiting the max concurrency via
>>>>>>>>>> throttleConcurrentBuilds [2] to some reasonable limit.
>>>>>>>>>>
>>>>>>>>>> * This test suite is meant to be an exhaustive post-commit
>>>>>>>>>> validation of Dataflow runner, and tests a lot of different aspects of a
>>>>>>>>>> runner. It would be more efficient to run locally only the tests affected
>>>>>>>>>> by your change. Note that this requires having access to a GCP project with
>>>>>>>>>> billing, but most Dataflow developers probably have access to this already.
>>>>>>>>>> The command for this is:
>>>>>>>>>>
>>>>>>>>>> ./gradlew :beam-runners-google-cloud-dataflow-java:validatesRunner
>>>>>>>>>> -PdataflowProject=myGcpProject -PdataflowTempRoot=gs://myGcsTempRoot
>>>>>>>>>> --tests "org.apache.beam.MyTestClass"
>>>>>>>>>>
>>>>>>>>>> [1]
>>>>>>>>>> https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle_PR/buildTimeTrend
>>>>>>>>>> [2]
>>>>>>>>>> https://jenkinsci.github.io/job-dsl-plugin/#method/javaposse.jobdsl.dsl.jobs.FreeStyleJob.throttleConcurrentBuilds
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Jul 2, 2018 at 8:33 AM Lukasz Cwik <lc...@google.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> The validates runner test parallelism is controlled here and is
>>>>>>>>>>> currently set to be "unlimited":
>>>>>>>>>>>
>>>>>>>>>>> https://github.com/apache/beam/blob/fbfe6ceaea9d99cb1c8964087aafaa2bc2297a03/runners/google-cloud-dataflow-java/build.gradle#L115
>>>>>>>>>>>
>>>>>>>>>>> Each test fork is run on a different gradle worker, so the
>>>>>>>>>>> number of parallel test runs is limited to the max number of workers
>>>>>>>>>>> configured which is controlled here:
>>>>>>>>>>>
>>>>>>>>>>> https://github.com/apache/beam/blob/fbfe6ceaea9d99cb1c8964087aafaa2bc2297a03/.test-infra/jenkins/job_PostCommit_Java_ValidatesRunner_Dataflow.groovy#L50
>>>>>>>>>>> It is currently configured to 3 * number of CPU cores.
>>>>>>>>>>>
>>>>>>>>>>> We are already running up to 48 Dataflow jobs in parallel.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Jun 30, 2018 at 9:51 AM Rafael Fernandez <
>>>>>>>>>>> rfernand@google.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> - How many resources to ValidatesRunner tests use?
>>>>>>>>>>>> - Where are those settings?
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, Jun 30, 2018 at 9:50 AM Reuven Lax <re...@google.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> The specific issue only affects Dataflow ValidatesRunner
>>>>>>>>>>>>> tests. We currently allow only one of these to run at a time, to control
>>>>>>>>>>>>> usage of Dataflow and of GCE quota. Other types of tests do not suffer from
>>>>>>>>>>>>> this issue.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I would like to see if it's possible to increase Dataflow
>>>>>>>>>>>>> quota so we can run more of these in parallel. It took me 8 hours end to
>>>>>>>>>>>>> end to run these tests (about 6 hours for the run to be scheduled). If
>>>>>>>>>>>>> there was a failure, I would have had to repeat the whole process. In the
>>>>>>>>>>>>> worst case, this process could have taken me days. While this is not as
>>>>>>>>>>>>> pressing as some other issues (as most people don't need to run the
>>>>>>>>>>>>> Dataflow tests on every PR), fixing it would make such changes much easier
>>>>>>>>>>>>> to manage.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Reuven
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, Jun 30, 2018 at 9:32 AM Rafael Fernandez <
>>>>>>>>>>>>> rfernand@google.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> +Reuven Lax <re...@google.com> told me yesterday that he was
>>>>>>>>>>>>>> waiting for some test to be scheduled and run, and it took 6 hours or so. I
>>>>>>>>>>>>>> would like to help reduce these wait times by increasing parallelism. I
>>>>>>>>>>>>>> need help understanding the continuous minimum of what we use. It seems the
>>>>>>>>>>>>>> following is true:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    - There seems to always be 16 jenkins machines on (16
>>>>>>>>>>>>>>    CPUs each)
>>>>>>>>>>>>>>    - There seems to be three GKE machines always on (1 CPU
>>>>>>>>>>>>>>    each)
>>>>>>>>>>>>>>    - Most (if not all) unit tests run on 1 machine, and seem
>>>>>>>>>>>>>>    to run one-at-a-time <-- I think we can safely parallelize this to 20.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> With current quotas, if we parallelize to 20 concurrent unit
>>>>>>>>>>>>>> tests, we still have room for 80 other concurrent dataflow jobs to execute,
>>>>>>>>>>>>>> with 75% of CPU capacity.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thoughts? Additional data?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> r
>>>>>>>>>>>>>>
>>>>>>>>>>>>> --
Got feedback? go/pabloem-feedback

Re: Parallelizing test runs

Posted by Andrew Pilloud <ap...@google.com>.
I like 1 and 2. How do credentials get into Jenkins? Could we create a user
per Jenkins host?

On Tue, Jul 31, 2018 at 4:33 PM Reuven Lax <re...@google.com> wrote:

> There was also a proposal to lump multiple tests into a single Dataflow
> job instead of spinning up a separate Dataflow job for each test.
>
> On Tue, Jul 31, 2018 at 4:26 PM Mikhail Gryzykhin <mi...@google.com>
> wrote:
>
>> I synced with Rafael. Below is summary of discussion.
>>
>> This quota is CreateRequestsPerMinutePerUser and it has 60 requests per
>> user by default.
>>
>> I've created Jira [BEAM-5053](
>> https://issues.apache.org/jira/browse/BEAM-5053) for this.
>>
>> I see following options we can utilize:
>> 1. Add retry logic. Although this limits us to 1 dataflow job start per
>> second for whole Jenkins. In long scale this can also block one test job if
>> other jobs take all the slots.
>> 2. Utilize different users to spin Dataflow jobs.
>> 3. Find way to rise quota limit on Dataflow. By default the field limits
>> value to 60 requests per minute.
>> 4. Long run generic suggestion: limit amount of dataflow jobs we spin up
>> and move tests to the form of unit or component tests.
>>
>> Please, fill in any insights or ideas you have on this.
>>
>> Regards,
>> --Mikhail
>>
>> Have feedback <http://go/migryz-feedback>?
>>
>>
>> On Tue, Jul 31, 2018 at 3:55 PM Mikhail Gryzykhin <mi...@google.com>
>> wrote:
>>
>>> Hi Everyone,
>>>
>>> Seems that we hit quota issue again:
>>> https://builds.apache.org/job/beam_PostCommit_Go_GradleBuild/553/consoleFull
>>>
>>> Can someone share information on how was this triaged last time or guide
>>> me on possible follow-up actions?
>>>
>>> Regards,
>>> --Mikhail
>>>
>>> Have feedback <http://go/migryz-feedback>?
>>>
>>>
>>> On Tue, Jul 3, 2018 at 9:12 PM Rafael Fernandez <rf...@google.com>
>>> wrote:
>>>
>>>> Summary for all folks following this story -- and many thanks for
>>>> explaining configs to me and pointing me to files and such.
>>>>
>>>> - Scott made changes to the config and we can now run 3
>>>> ValidatesRunner.Dataflow in parallel (each run is about 2 hours)
>>>> - With the latest quota changes, we peaked at ~70% capacity in
>>>> concurrent Dataflow jobs when running those
>>>> - I've been keeping an eye on quota peaks for all resources today and
>>>> have not seen any worryisome limits overall.
>>>> - Also note there are improvements planned to the
>>>> ValidatesRunner.Dataflow test so various items get batched and the test
>>>> itself runs faster -- I believe it's on Alan's radar
>>>>
>>>> Cheers,
>>>> r
>>>>
>>>> On Mon, Jul 2, 2018 at 4:23 PM Rafael Fernandez <rf...@google.com>
>>>> wrote:
>>>>
>>>>> Done!
>>>>>
>>>>> On Mon, Jul 2, 2018 at 4:10 PM Scott Wegner <sc...@apache.org> wrote:
>>>>>
>>>>>> Hey Rafael, looks like we need more 'INSTANCE_TEMPLATES' quota [1].
>>>>>> Can you take a look? I've filed [BEAM-4722]:
>>>>>> https://issues.apache.org/jira/browse/BEAM-4722
>>>>>>
>>>>>> [1] https://github.com/apache/beam/pull/5861#issuecomment-401963630
>>>>>>
>>>>>> On Mon, Jul 2, 2018 at 11:33 AM Rafael Fernandez <rf...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> OK, Scott just sent https://github.com/apache/beam/pull/5860 .
>>>>>>> Quotas should not be a problem, if they are, please file a JIRA under
>>>>>>> gcp-quota.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> r
>>>>>>>
>>>>>>> On Mon, Jul 2, 2018 at 10:06 AM Kenneth Knowles <kl...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> One thing that is nice when you do this is to be able to share your
>>>>>>>> results. Though if all you are sharing is "they passed" then I guess we
>>>>>>>> don't have to insist on evidence.
>>>>>>>>
>>>>>>>> Kenn
>>>>>>>>
>>>>>>>> On Mon, Jul 2, 2018 at 9:25 AM Scott Wegner <sc...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> A few thoughts:
>>>>>>>>>
>>>>>>>>> * The Jenkins job getting backed up
>>>>>>>>> is beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle_PR [1]. Since
>>>>>>>>> Mikhail refactored Jenkins jobs, this only runs when explicitly requested
>>>>>>>>> via "Run Dataflow ValidatesRunner", and only has 8 total runs. So this job
>>>>>>>>> is idle more often than backlogged.
>>>>>>>>>
>>>>>>>>> * It's difficult to reason about our exact quota needs because
>>>>>>>>> Dataflow jobs get launched from various Jenkins jobs that have different
>>>>>>>>> parallelism configurations. If we have budget, we could enable concurrent
>>>>>>>>> execution of this job and increase our quota enough to give some breathing
>>>>>>>>> room. If we do this, I recommend limiting the max concurrency via
>>>>>>>>> throttleConcurrentBuilds [2] to some reasonable limit.
>>>>>>>>>
>>>>>>>>> * This test suite is meant to be an exhaustive post-commit
>>>>>>>>> validation of Dataflow runner, and tests a lot of different aspects of a
>>>>>>>>> runner. It would be more efficient to run locally only the tests affected
>>>>>>>>> by your change. Note that this requires having access to a GCP project with
>>>>>>>>> billing, but most Dataflow developers probably have access to this already.
>>>>>>>>> The command for this is:
>>>>>>>>>
>>>>>>>>> ./gradlew :beam-runners-google-cloud-dataflow-java:validatesRunner
>>>>>>>>> -PdataflowProject=myGcpProject -PdataflowTempRoot=gs://myGcsTempRoot
>>>>>>>>> --tests "org.apache.beam.MyTestClass"
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle_PR/buildTimeTrend
>>>>>>>>> [2]
>>>>>>>>> https://jenkinsci.github.io/job-dsl-plugin/#method/javaposse.jobdsl.dsl.jobs.FreeStyleJob.throttleConcurrentBuilds
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Jul 2, 2018 at 8:33 AM Lukasz Cwik <lc...@google.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> The validates runner test parallelism is controlled here and is
>>>>>>>>>> currently set to be "unlimited":
>>>>>>>>>>
>>>>>>>>>> https://github.com/apache/beam/blob/fbfe6ceaea9d99cb1c8964087aafaa2bc2297a03/runners/google-cloud-dataflow-java/build.gradle#L115
>>>>>>>>>>
>>>>>>>>>> Each test fork is run on a different gradle worker, so the number
>>>>>>>>>> of parallel test runs is limited to the max number of workers configured
>>>>>>>>>> which is controlled here:
>>>>>>>>>>
>>>>>>>>>> https://github.com/apache/beam/blob/fbfe6ceaea9d99cb1c8964087aafaa2bc2297a03/.test-infra/jenkins/job_PostCommit_Java_ValidatesRunner_Dataflow.groovy#L50
>>>>>>>>>> It is currently configured to 3 * number of CPU cores.
>>>>>>>>>>
>>>>>>>>>> We are already running up to 48 Dataflow jobs in parallel.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, Jun 30, 2018 at 9:51 AM Rafael Fernandez <
>>>>>>>>>> rfernand@google.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> - How many resources to ValidatesRunner tests use?
>>>>>>>>>>> - Where are those settings?
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Jun 30, 2018 at 9:50 AM Reuven Lax <re...@google.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> The specific issue only affects Dataflow ValidatesRunner tests.
>>>>>>>>>>>> We currently allow only one of these to run at a time, to control usage of
>>>>>>>>>>>> Dataflow and of GCE quota. Other types of tests do not suffer from this
>>>>>>>>>>>> issue.
>>>>>>>>>>>>
>>>>>>>>>>>> I would like to see if it's possible to increase Dataflow quota
>>>>>>>>>>>> so we can run more of these in parallel. It took me 8 hours end to end to
>>>>>>>>>>>> run these tests (about 6 hours for the run to be scheduled). If there was a
>>>>>>>>>>>> failure, I would have had to repeat the whole process. In the worst case,
>>>>>>>>>>>> this process could have taken me days. While this is not as pressing as
>>>>>>>>>>>> some other issues (as most people don't need to run the Dataflow tests on
>>>>>>>>>>>> every PR), fixing it would make such changes much easier to manage.
>>>>>>>>>>>>
>>>>>>>>>>>> Reuven
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, Jun 30, 2018 at 9:32 AM Rafael Fernandez <
>>>>>>>>>>>> rfernand@google.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> +Reuven Lax <re...@google.com> told me yesterday that he was
>>>>>>>>>>>>> waiting for some test to be scheduled and run, and it took 6 hours or so. I
>>>>>>>>>>>>> would like to help reduce these wait times by increasing parallelism. I
>>>>>>>>>>>>> need help understanding the continuous minimum of what we use. It seems the
>>>>>>>>>>>>> following is true:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>    - There seems to always be 16 jenkins machines on (16 CPUs
>>>>>>>>>>>>>    each)
>>>>>>>>>>>>>    - There seems to be three GKE machines always on (1 CPU
>>>>>>>>>>>>>    each)
>>>>>>>>>>>>>    - Most (if not all) unit tests run on 1 machine, and seem
>>>>>>>>>>>>>    to run one-at-a-time <-- I think we can safely parallelize this to 20.
>>>>>>>>>>>>>
>>>>>>>>>>>>> With current quotas, if we parallelize to 20 concurrent unit
>>>>>>>>>>>>> tests, we still have room for 80 other concurrent dataflow jobs to execute,
>>>>>>>>>>>>> with 75% of CPU capacity.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thoughts? Additional data?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> r
>>>>>>>>>>>>>
>>>>>>>>>>>>

Re: Parallelizing test runs

Posted by Reuven Lax <re...@google.com>.
There was also a proposal to lump multiple tests into a single Dataflow job
instead of spinning up a separate Dataflow job for each test.

On Tue, Jul 31, 2018 at 4:26 PM Mikhail Gryzykhin <mi...@google.com> wrote:

> I synced with Rafael. Below is summary of discussion.
>
> This quota is CreateRequestsPerMinutePerUser and it has 60 requests per
> user by default.
>
> I've created Jira [BEAM-5053](
> https://issues.apache.org/jira/browse/BEAM-5053) for this.
>
> I see following options we can utilize:
> 1. Add retry logic. Although this limits us to 1 dataflow job start per
> second for whole Jenkins. In long scale this can also block one test job if
> other jobs take all the slots.
> 2. Utilize different users to spin Dataflow jobs.
> 3. Find way to rise quota limit on Dataflow. By default the field limits
> value to 60 requests per minute.
> 4. Long run generic suggestion: limit amount of dataflow jobs we spin up
> and move tests to the form of unit or component tests.
>
> Please, fill in any insights or ideas you have on this.
>
> Regards,
> --Mikhail
>
> Have feedback <http://go/migryz-feedback>?
>
>
> On Tue, Jul 31, 2018 at 3:55 PM Mikhail Gryzykhin <mi...@google.com>
> wrote:
>
>> Hi Everyone,
>>
>> Seems that we hit quota issue again:
>> https://builds.apache.org/job/beam_PostCommit_Go_GradleBuild/553/consoleFull
>>
>> Can someone share information on how was this triaged last time or guide
>> me on possible follow-up actions?
>>
>> Regards,
>> --Mikhail
>>
>> Have feedback <http://go/migryz-feedback>?
>>
>>
>> On Tue, Jul 3, 2018 at 9:12 PM Rafael Fernandez <rf...@google.com>
>> wrote:
>>
>>> Summary for all folks following this story -- and many thanks for
>>> explaining configs to me and pointing me to files and such.
>>>
>>> - Scott made changes to the config and we can now run 3
>>> ValidatesRunner.Dataflow in parallel (each run is about 2 hours)
>>> - With the latest quota changes, we peaked at ~70% capacity in
>>> concurrent Dataflow jobs when running those
>>> - I've been keeping an eye on quota peaks for all resources today and
>>> have not seen any worryisome limits overall.
>>> - Also note there are improvements planned to the
>>> ValidatesRunner.Dataflow test so various items get batched and the test
>>> itself runs faster -- I believe it's on Alan's radar
>>>
>>> Cheers,
>>> r
>>>
>>> On Mon, Jul 2, 2018 at 4:23 PM Rafael Fernandez <rf...@google.com>
>>> wrote:
>>>
>>>> Done!
>>>>
>>>> On Mon, Jul 2, 2018 at 4:10 PM Scott Wegner <sc...@apache.org> wrote:
>>>>
>>>>> Hey Rafael, looks like we need more 'INSTANCE_TEMPLATES' quota [1].
>>>>> Can you take a look? I've filed [BEAM-4722]:
>>>>> https://issues.apache.org/jira/browse/BEAM-4722
>>>>>
>>>>> [1] https://github.com/apache/beam/pull/5861#issuecomment-401963630
>>>>>
>>>>> On Mon, Jul 2, 2018 at 11:33 AM Rafael Fernandez <rf...@google.com>
>>>>> wrote:
>>>>>
>>>>>> OK, Scott just sent https://github.com/apache/beam/pull/5860 .
>>>>>> Quotas should not be a problem, if they are, please file a JIRA under
>>>>>> gcp-quota.
>>>>>>
>>>>>> Cheers,
>>>>>> r
>>>>>>
>>>>>> On Mon, Jul 2, 2018 at 10:06 AM Kenneth Knowles <kl...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> One thing that is nice when you do this is to be able to share your
>>>>>>> results. Though if all you are sharing is "they passed" then I guess we
>>>>>>> don't have to insist on evidence.
>>>>>>>
>>>>>>> Kenn
>>>>>>>
>>>>>>> On Mon, Jul 2, 2018 at 9:25 AM Scott Wegner <sc...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> A few thoughts:
>>>>>>>>
>>>>>>>> * The Jenkins job getting backed up
>>>>>>>> is beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle_PR [1]. Since
>>>>>>>> Mikhail refactored Jenkins jobs, this only runs when explicitly requested
>>>>>>>> via "Run Dataflow ValidatesRunner", and only has 8 total runs. So this job
>>>>>>>> is idle more often than backlogged.
>>>>>>>>
>>>>>>>> * It's difficult to reason about our exact quota needs because
>>>>>>>> Dataflow jobs get launched from various Jenkins jobs that have different
>>>>>>>> parallelism configurations. If we have budget, we could enable concurrent
>>>>>>>> execution of this job and increase our quota enough to give some breathing
>>>>>>>> room. If we do this, I recommend limiting the max concurrency via
>>>>>>>> throttleConcurrentBuilds [2] to some reasonable limit.
>>>>>>>>
>>>>>>>> * This test suite is meant to be an exhaustive post-commit
>>>>>>>> validation of Dataflow runner, and tests a lot of different aspects of a
>>>>>>>> runner. It would be more efficient to run locally only the tests affected
>>>>>>>> by your change. Note that this requires having access to a GCP project with
>>>>>>>> billing, but most Dataflow developers probably have access to this already.
>>>>>>>> The command for this is:
>>>>>>>>
>>>>>>>> ./gradlew :beam-runners-google-cloud-dataflow-java:validatesRunner
>>>>>>>> -PdataflowProject=myGcpProject -PdataflowTempRoot=gs://myGcsTempRoot
>>>>>>>> --tests "org.apache.beam.MyTestClass"
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle_PR/buildTimeTrend
>>>>>>>> [2]
>>>>>>>> https://jenkinsci.github.io/job-dsl-plugin/#method/javaposse.jobdsl.dsl.jobs.FreeStyleJob.throttleConcurrentBuilds
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Jul 2, 2018 at 8:33 AM Lukasz Cwik <lc...@google.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> The validates runner test parallelism is controlled here and is
>>>>>>>>> currently set to be "unlimited":
>>>>>>>>>
>>>>>>>>> https://github.com/apache/beam/blob/fbfe6ceaea9d99cb1c8964087aafaa2bc2297a03/runners/google-cloud-dataflow-java/build.gradle#L115
>>>>>>>>>
>>>>>>>>> Each test fork is run on a different gradle worker, so the number
>>>>>>>>> of parallel test runs is limited to the max number of workers configured
>>>>>>>>> which is controlled here:
>>>>>>>>>
>>>>>>>>> https://github.com/apache/beam/blob/fbfe6ceaea9d99cb1c8964087aafaa2bc2297a03/.test-infra/jenkins/job_PostCommit_Java_ValidatesRunner_Dataflow.groovy#L50
>>>>>>>>> It is currently configured to 3 * number of CPU cores.
>>>>>>>>>
>>>>>>>>> We are already running up to 48 Dataflow jobs in parallel.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, Jun 30, 2018 at 9:51 AM Rafael Fernandez <
>>>>>>>>> rfernand@google.com> wrote:
>>>>>>>>>
>>>>>>>>>> - How many resources to ValidatesRunner tests use?
>>>>>>>>>> - Where are those settings?
>>>>>>>>>>
>>>>>>>>>> On Sat, Jun 30, 2018 at 9:50 AM Reuven Lax <re...@google.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> The specific issue only affects Dataflow ValidatesRunner tests.
>>>>>>>>>>> We currently allow only one of these to run at a time, to control usage of
>>>>>>>>>>> Dataflow and of GCE quota. Other types of tests do not suffer from this
>>>>>>>>>>> issue.
>>>>>>>>>>>
>>>>>>>>>>> I would like to see if it's possible to increase Dataflow quota
>>>>>>>>>>> so we can run more of these in parallel. It took me 8 hours end to end to
>>>>>>>>>>> run these tests (about 6 hours for the run to be scheduled). If there was a
>>>>>>>>>>> failure, I would have had to repeat the whole process. In the worst case,
>>>>>>>>>>> this process could have taken me days. While this is not as pressing as
>>>>>>>>>>> some other issues (as most people don't need to run the Dataflow tests on
>>>>>>>>>>> every PR), fixing it would make such changes much easier to manage.
>>>>>>>>>>>
>>>>>>>>>>> Reuven
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Jun 30, 2018 at 9:32 AM Rafael Fernandez <
>>>>>>>>>>> rfernand@google.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> +Reuven Lax <re...@google.com> told me yesterday that he was
>>>>>>>>>>>> waiting for some test to be scheduled and run, and it took 6 hours or so. I
>>>>>>>>>>>> would like to help reduce these wait times by increasing parallelism. I
>>>>>>>>>>>> need help understanding the continuous minimum of what we use. It seems the
>>>>>>>>>>>> following is true:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>    - There seems to always be 16 jenkins machines on (16 CPUs
>>>>>>>>>>>>    each)
>>>>>>>>>>>>    - There seems to be three GKE machines always on (1 CPU
>>>>>>>>>>>>    each)
>>>>>>>>>>>>    - Most (if not all) unit tests run on 1 machine, and seem
>>>>>>>>>>>>    to run one-at-a-time <-- I think we can safely parallelize this to 20.
>>>>>>>>>>>>
>>>>>>>>>>>> With current quotas, if we parallelize to 20 concurrent unit
>>>>>>>>>>>> tests, we still have room for 80 other concurrent dataflow jobs to execute,
>>>>>>>>>>>> with 75% of CPU capacity.
>>>>>>>>>>>>
>>>>>>>>>>>> Thoughts? Additional data?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> r
>>>>>>>>>>>>
>>>>>>>>>>>

Re: Parallelizing test runs

Posted by Mikhail Gryzykhin <mi...@google.com>.
I synced with Rafael. Below is summary of discussion.

This quota is CreateRequestsPerMinutePerUser and it has 60 requests per
user by default.

I've created Jira [BEAM-5053](
https://issues.apache.org/jira/browse/BEAM-5053) for this.

I see following options we can utilize:
1. Add retry logic. Although this limits us to 1 dataflow job start per
second for whole Jenkins. In long scale this can also block one test job if
other jobs take all the slots.
2. Utilize different users to spin Dataflow jobs.
3. Find way to rise quota limit on Dataflow. By default the field limits
value to 60 requests per minute.
4. Long run generic suggestion: limit amount of dataflow jobs we spin up
and move tests to the form of unit or component tests.

Please, fill in any insights or ideas you have on this.

Regards,
--Mikhail

Have feedback <http://go/migryz-feedback>?


On Tue, Jul 31, 2018 at 3:55 PM Mikhail Gryzykhin <mi...@google.com> wrote:

> Hi Everyone,
>
> Seems that we hit quota issue again:
> https://builds.apache.org/job/beam_PostCommit_Go_GradleBuild/553/consoleFull
>
> Can someone share information on how was this triaged last time or guide
> me on possible follow-up actions?
>
> Regards,
> --Mikhail
>
> Have feedback <http://go/migryz-feedback>?
>
>
> On Tue, Jul 3, 2018 at 9:12 PM Rafael Fernandez <rf...@google.com>
> wrote:
>
>> Summary for all folks following this story -- and many thanks for
>> explaining configs to me and pointing me to files and such.
>>
>> - Scott made changes to the config and we can now run 3
>> ValidatesRunner.Dataflow in parallel (each run is about 2 hours)
>> - With the latest quota changes, we peaked at ~70% capacity in concurrent
>> Dataflow jobs when running those
>> - I've been keeping an eye on quota peaks for all resources today and
>> have not seen any worryisome limits overall.
>> - Also note there are improvements planned to the
>> ValidatesRunner.Dataflow test so various items get batched and the test
>> itself runs faster -- I believe it's on Alan's radar
>>
>> Cheers,
>> r
>>
>> On Mon, Jul 2, 2018 at 4:23 PM Rafael Fernandez <rf...@google.com>
>> wrote:
>>
>>> Done!
>>>
>>> On Mon, Jul 2, 2018 at 4:10 PM Scott Wegner <sc...@apache.org> wrote:
>>>
>>>> Hey Rafael, looks like we need more 'INSTANCE_TEMPLATES' quota [1]. Can
>>>> you take a look? I've filed [BEAM-4722]:
>>>> https://issues.apache.org/jira/browse/BEAM-4722
>>>>
>>>> [1] https://github.com/apache/beam/pull/5861#issuecomment-401963630
>>>>
>>>> On Mon, Jul 2, 2018 at 11:33 AM Rafael Fernandez <rf...@google.com>
>>>> wrote:
>>>>
>>>>> OK, Scott just sent https://github.com/apache/beam/pull/5860 . Quotas
>>>>> should not be a problem, if they are, please file a JIRA under gcp-quota.
>>>>>
>>>>> Cheers,
>>>>> r
>>>>>
>>>>> On Mon, Jul 2, 2018 at 10:06 AM Kenneth Knowles <kl...@google.com>
>>>>> wrote:
>>>>>
>>>>>> One thing that is nice when you do this is to be able to share your
>>>>>> results. Though if all you are sharing is "they passed" then I guess we
>>>>>> don't have to insist on evidence.
>>>>>>
>>>>>> Kenn
>>>>>>
>>>>>> On Mon, Jul 2, 2018 at 9:25 AM Scott Wegner <sc...@apache.org> wrote:
>>>>>>
>>>>>>> A few thoughts:
>>>>>>>
>>>>>>> * The Jenkins job getting backed up
>>>>>>> is beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle_PR [1]. Since
>>>>>>> Mikhail refactored Jenkins jobs, this only runs when explicitly requested
>>>>>>> via "Run Dataflow ValidatesRunner", and only has 8 total runs. So this job
>>>>>>> is idle more often than backlogged.
>>>>>>>
>>>>>>> * It's difficult to reason about our exact quota needs because
>>>>>>> Dataflow jobs get launched from various Jenkins jobs that have different
>>>>>>> parallelism configurations. If we have budget, we could enable concurrent
>>>>>>> execution of this job and increase our quota enough to give some breathing
>>>>>>> room. If we do this, I recommend limiting the max concurrency via
>>>>>>> throttleConcurrentBuilds [2] to some reasonable limit.
>>>>>>>
>>>>>>> * This test suite is meant to be an exhaustive post-commit
>>>>>>> validation of Dataflow runner, and tests a lot of different aspects of a
>>>>>>> runner. It would be more efficient to run locally only the tests affected
>>>>>>> by your change. Note that this requires having access to a GCP project with
>>>>>>> billing, but most Dataflow developers probably have access to this already.
>>>>>>> The command for this is:
>>>>>>>
>>>>>>> ./gradlew :beam-runners-google-cloud-dataflow-java:validatesRunner
>>>>>>> -PdataflowProject=myGcpProject -PdataflowTempRoot=gs://myGcsTempRoot
>>>>>>> --tests "org.apache.beam.MyTestClass"
>>>>>>>
>>>>>>> [1]
>>>>>>> https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle_PR/buildTimeTrend
>>>>>>> [2]
>>>>>>> https://jenkinsci.github.io/job-dsl-plugin/#method/javaposse.jobdsl.dsl.jobs.FreeStyleJob.throttleConcurrentBuilds
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Jul 2, 2018 at 8:33 AM Lukasz Cwik <lc...@google.com> wrote:
>>>>>>>
>>>>>>>> The validates runner test parallelism is controlled here and is
>>>>>>>> currently set to be "unlimited":
>>>>>>>>
>>>>>>>> https://github.com/apache/beam/blob/fbfe6ceaea9d99cb1c8964087aafaa2bc2297a03/runners/google-cloud-dataflow-java/build.gradle#L115
>>>>>>>>
>>>>>>>> Each test fork is run on a different gradle worker, so the number
>>>>>>>> of parallel test runs is limited to the max number of workers configured
>>>>>>>> which is controlled here:
>>>>>>>>
>>>>>>>> https://github.com/apache/beam/blob/fbfe6ceaea9d99cb1c8964087aafaa2bc2297a03/.test-infra/jenkins/job_PostCommit_Java_ValidatesRunner_Dataflow.groovy#L50
>>>>>>>> It is currently configured to 3 * number of CPU cores.
>>>>>>>>
>>>>>>>> We are already running up to 48 Dataflow jobs in parallel.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, Jun 30, 2018 at 9:51 AM Rafael Fernandez <
>>>>>>>> rfernand@google.com> wrote:
>>>>>>>>
>>>>>>>>> - How many resources to ValidatesRunner tests use?
>>>>>>>>> - Where are those settings?
>>>>>>>>>
>>>>>>>>> On Sat, Jun 30, 2018 at 9:50 AM Reuven Lax <re...@google.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> The specific issue only affects Dataflow ValidatesRunner tests.
>>>>>>>>>> We currently allow only one of these to run at a time, to control usage of
>>>>>>>>>> Dataflow and of GCE quota. Other types of tests do not suffer from this
>>>>>>>>>> issue.
>>>>>>>>>>
>>>>>>>>>> I would like to see if it's possible to increase Dataflow quota
>>>>>>>>>> so we can run more of these in parallel. It took me 8 hours end to end to
>>>>>>>>>> run these tests (about 6 hours for the run to be scheduled). If there was a
>>>>>>>>>> failure, I would have had to repeat the whole process. In the worst case,
>>>>>>>>>> this process could have taken me days. While this is not as pressing as
>>>>>>>>>> some other issues (as most people don't need to run the Dataflow tests on
>>>>>>>>>> every PR), fixing it would make such changes much easier to manage.
>>>>>>>>>>
>>>>>>>>>> Reuven
>>>>>>>>>>
>>>>>>>>>> On Sat, Jun 30, 2018 at 9:32 AM Rafael Fernandez <
>>>>>>>>>> rfernand@google.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> +Reuven Lax <re...@google.com> told me yesterday that he was
>>>>>>>>>>> waiting for some test to be scheduled and run, and it took 6 hours or so. I
>>>>>>>>>>> would like to help reduce these wait times by increasing parallelism. I
>>>>>>>>>>> need help understanding the continuous minimum of what we use. It seems the
>>>>>>>>>>> following is true:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    - There seems to always be 16 jenkins machines on (16 CPUs
>>>>>>>>>>>    each)
>>>>>>>>>>>    - There seems to be three GKE machines always on (1 CPU each)
>>>>>>>>>>>    - Most (if not all) unit tests run on 1 machine, and seem to
>>>>>>>>>>>    run one-at-a-time <-- I think we can safely parallelize this to 20.
>>>>>>>>>>>
>>>>>>>>>>> With current quotas, if we parallelize to 20 concurrent unit
>>>>>>>>>>> tests, we still have room for 80 other concurrent dataflow jobs to execute,
>>>>>>>>>>> with 75% of CPU capacity.
>>>>>>>>>>>
>>>>>>>>>>> Thoughts? Additional data?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> r
>>>>>>>>>>>
>>>>>>>>>>

Re: Parallelizing test runs

Posted by Mikhail Gryzykhin <mi...@google.com>.
Hi Everyone,

Seems that we hit quota issue again:
https://builds.apache.org/job/beam_PostCommit_Go_GradleBuild/553/consoleFull

Can someone share information on how was this triaged last time or guide me
on possible follow-up actions?

Regards,
--Mikhail

Have feedback <http://go/migryz-feedback>?


On Tue, Jul 3, 2018 at 9:12 PM Rafael Fernandez <rf...@google.com> wrote:

> Summary for all folks following this story -- and many thanks for
> explaining configs to me and pointing me to files and such.
>
> - Scott made changes to the config and we can now run 3
> ValidatesRunner.Dataflow in parallel (each run is about 2 hours)
> - With the latest quota changes, we peaked at ~70% capacity in concurrent
> Dataflow jobs when running those
> - I've been keeping an eye on quota peaks for all resources today and have
> not seen any worryisome limits overall.
> - Also note there are improvements planned to the ValidatesRunner.Dataflow
> test so various items get batched and the test itself runs faster -- I
> believe it's on Alan's radar
>
> Cheers,
> r
>
> On Mon, Jul 2, 2018 at 4:23 PM Rafael Fernandez <rf...@google.com>
> wrote:
>
>> Done!
>>
>> On Mon, Jul 2, 2018 at 4:10 PM Scott Wegner <sc...@apache.org> wrote:
>>
>>> Hey Rafael, looks like we need more 'INSTANCE_TEMPLATES' quota [1]. Can
>>> you take a look? I've filed [BEAM-4722]:
>>> https://issues.apache.org/jira/browse/BEAM-4722
>>>
>>> [1] https://github.com/apache/beam/pull/5861#issuecomment-401963630
>>>
>>> On Mon, Jul 2, 2018 at 11:33 AM Rafael Fernandez <rf...@google.com>
>>> wrote:
>>>
>>>> OK, Scott just sent https://github.com/apache/beam/pull/5860 . Quotas
>>>> should not be a problem, if they are, please file a JIRA under gcp-quota.
>>>>
>>>> Cheers,
>>>> r
>>>>
>>>> On Mon, Jul 2, 2018 at 10:06 AM Kenneth Knowles <kl...@google.com> wrote:
>>>>
>>>>> One thing that is nice when you do this is to be able to share your
>>>>> results. Though if all you are sharing is "they passed" then I guess we
>>>>> don't have to insist on evidence.
>>>>>
>>>>> Kenn
>>>>>
>>>>> On Mon, Jul 2, 2018 at 9:25 AM Scott Wegner <sc...@apache.org> wrote:
>>>>>
>>>>>> A few thoughts:
>>>>>>
>>>>>> * The Jenkins job getting backed up
>>>>>> is beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle_PR [1]. Since
>>>>>> Mikhail refactored Jenkins jobs, this only runs when explicitly requested
>>>>>> via "Run Dataflow ValidatesRunner", and only has 8 total runs. So this job
>>>>>> is idle more often than backlogged.
>>>>>>
>>>>>> * It's difficult to reason about our exact quota needs because
>>>>>> Dataflow jobs get launched from various Jenkins jobs that have different
>>>>>> parallelism configurations. If we have budget, we could enable concurrent
>>>>>> execution of this job and increase our quota enough to give some breathing
>>>>>> room. If we do this, I recommend limiting the max concurrency via
>>>>>> throttleConcurrentBuilds [2] to some reasonable limit.
>>>>>>
>>>>>> * This test suite is meant to be an exhaustive post-commit validation
>>>>>> of Dataflow runner, and tests a lot of different aspects of a runner. It
>>>>>> would be more efficient to run locally only the tests affected by your
>>>>>> change. Note that this requires having access to a GCP project with
>>>>>> billing, but most Dataflow developers probably have access to this already.
>>>>>> The command for this is:
>>>>>>
>>>>>> ./gradlew :beam-runners-google-cloud-dataflow-java:validatesRunner
>>>>>> -PdataflowProject=myGcpProject -PdataflowTempRoot=gs://myGcsTempRoot
>>>>>> --tests "org.apache.beam.MyTestClass"
>>>>>>
>>>>>> [1]
>>>>>> https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle_PR/buildTimeTrend
>>>>>> [2]
>>>>>> https://jenkinsci.github.io/job-dsl-plugin/#method/javaposse.jobdsl.dsl.jobs.FreeStyleJob.throttleConcurrentBuilds
>>>>>>
>>>>>>
>>>>>> On Mon, Jul 2, 2018 at 8:33 AM Lukasz Cwik <lc...@google.com> wrote:
>>>>>>
>>>>>>> The validates runner test parallelism is controlled here and is
>>>>>>> currently set to be "unlimited":
>>>>>>>
>>>>>>> https://github.com/apache/beam/blob/fbfe6ceaea9d99cb1c8964087aafaa2bc2297a03/runners/google-cloud-dataflow-java/build.gradle#L115
>>>>>>>
>>>>>>> Each test fork is run on a different gradle worker, so the number of
>>>>>>> parallel test runs is limited to the max number of workers configured which
>>>>>>> is controlled here:
>>>>>>>
>>>>>>> https://github.com/apache/beam/blob/fbfe6ceaea9d99cb1c8964087aafaa2bc2297a03/.test-infra/jenkins/job_PostCommit_Java_ValidatesRunner_Dataflow.groovy#L50
>>>>>>> It is currently configured to 3 * number of CPU cores.
>>>>>>>
>>>>>>> We are already running up to 48 Dataflow jobs in parallel.
>>>>>>>
>>>>>>>
>>>>>>> On Sat, Jun 30, 2018 at 9:51 AM Rafael Fernandez <
>>>>>>> rfernand@google.com> wrote:
>>>>>>>
>>>>>>>> - How many resources to ValidatesRunner tests use?
>>>>>>>> - Where are those settings?
>>>>>>>>
>>>>>>>> On Sat, Jun 30, 2018 at 9:50 AM Reuven Lax <re...@google.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> The specific issue only affects Dataflow ValidatesRunner tests. We
>>>>>>>>> currently allow only one of these to run at a time, to control usage of
>>>>>>>>> Dataflow and of GCE quota. Other types of tests do not suffer from this
>>>>>>>>> issue.
>>>>>>>>>
>>>>>>>>> I would like to see if it's possible to increase Dataflow quota so
>>>>>>>>> we can run more of these in parallel. It took me 8 hours end to end to run
>>>>>>>>> these tests (about 6 hours for the run to be scheduled). If there was a
>>>>>>>>> failure, I would have had to repeat the whole process. In the worst case,
>>>>>>>>> this process could have taken me days. While this is not as pressing as
>>>>>>>>> some other issues (as most people don't need to run the Dataflow tests on
>>>>>>>>> every PR), fixing it would make such changes much easier to manage.
>>>>>>>>>
>>>>>>>>> Reuven
>>>>>>>>>
>>>>>>>>> On Sat, Jun 30, 2018 at 9:32 AM Rafael Fernandez <
>>>>>>>>> rfernand@google.com> wrote:
>>>>>>>>>
>>>>>>>>>> +Reuven Lax <re...@google.com> told me yesterday that he was
>>>>>>>>>> waiting for some test to be scheduled and run, and it took 6 hours or so. I
>>>>>>>>>> would like to help reduce these wait times by increasing parallelism. I
>>>>>>>>>> need help understanding the continuous minimum of what we use. It seems the
>>>>>>>>>> following is true:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - There seems to always be 16 jenkins machines on (16 CPUs
>>>>>>>>>>    each)
>>>>>>>>>>    - There seems to be three GKE machines always on (1 CPU each)
>>>>>>>>>>    - Most (if not all) unit tests run on 1 machine, and seem to
>>>>>>>>>>    run one-at-a-time <-- I think we can safely parallelize this to 20.
>>>>>>>>>>
>>>>>>>>>> With current quotas, if we parallelize to 20 concurrent unit
>>>>>>>>>> tests, we still have room for 80 other concurrent dataflow jobs to execute,
>>>>>>>>>> with 75% of CPU capacity.
>>>>>>>>>>
>>>>>>>>>> Thoughts? Additional data?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> r
>>>>>>>>>>
>>>>>>>>>

Re: Parallelizing test runs

Posted by Rafael Fernandez <rf...@google.com>.
Summary for all folks following this story -- and many thanks for
explaining configs to me and pointing me to files and such.

- Scott made changes to the config and we can now run 3
ValidatesRunner.Dataflow in parallel (each run is about 2 hours)
- With the latest quota changes, we peaked at ~70% capacity in concurrent
Dataflow jobs when running those
- I've been keeping an eye on quota peaks for all resources today and have
not seen any worryisome limits overall.
- Also note there are improvements planned to the ValidatesRunner.Dataflow
test so various items get batched and the test itself runs faster -- I
believe it's on Alan's radar

Cheers,
r

On Mon, Jul 2, 2018 at 4:23 PM Rafael Fernandez <rf...@google.com> wrote:

> Done!
>
> On Mon, Jul 2, 2018 at 4:10 PM Scott Wegner <sc...@apache.org> wrote:
>
>> Hey Rafael, looks like we need more 'INSTANCE_TEMPLATES' quota [1]. Can
>> you take a look? I've filed [BEAM-4722]:
>> https://issues.apache.org/jira/browse/BEAM-4722
>>
>> [1] https://github.com/apache/beam/pull/5861#issuecomment-401963630
>>
>> On Mon, Jul 2, 2018 at 11:33 AM Rafael Fernandez <rf...@google.com>
>> wrote:
>>
>>> OK, Scott just sent https://github.com/apache/beam/pull/5860 . Quotas
>>> should not be a problem, if they are, please file a JIRA under gcp-quota.
>>>
>>> Cheers,
>>> r
>>>
>>> On Mon, Jul 2, 2018 at 10:06 AM Kenneth Knowles <kl...@google.com> wrote:
>>>
>>>> One thing that is nice when you do this is to be able to share your
>>>> results. Though if all you are sharing is "they passed" then I guess we
>>>> don't have to insist on evidence.
>>>>
>>>> Kenn
>>>>
>>>> On Mon, Jul 2, 2018 at 9:25 AM Scott Wegner <sc...@apache.org> wrote:
>>>>
>>>>> A few thoughts:
>>>>>
>>>>> * The Jenkins job getting backed up
>>>>> is beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle_PR [1]. Since
>>>>> Mikhail refactored Jenkins jobs, this only runs when explicitly requested
>>>>> via "Run Dataflow ValidatesRunner", and only has 8 total runs. So this job
>>>>> is idle more often than backlogged.
>>>>>
>>>>> * It's difficult to reason about our exact quota needs because
>>>>> Dataflow jobs get launched from various Jenkins jobs that have different
>>>>> parallelism configurations. If we have budget, we could enable concurrent
>>>>> execution of this job and increase our quota enough to give some breathing
>>>>> room. If we do this, I recommend limiting the max concurrency via
>>>>> throttleConcurrentBuilds [2] to some reasonable limit.
>>>>>
>>>>> * This test suite is meant to be an exhaustive post-commit validation
>>>>> of Dataflow runner, and tests a lot of different aspects of a runner. It
>>>>> would be more efficient to run locally only the tests affected by your
>>>>> change. Note that this requires having access to a GCP project with
>>>>> billing, but most Dataflow developers probably have access to this already.
>>>>> The command for this is:
>>>>>
>>>>> ./gradlew :beam-runners-google-cloud-dataflow-java:validatesRunner
>>>>> -PdataflowProject=myGcpProject -PdataflowTempRoot=gs://myGcsTempRoot
>>>>> --tests "org.apache.beam.MyTestClass"
>>>>>
>>>>> [1]
>>>>> https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle_PR/buildTimeTrend
>>>>> [2]
>>>>> https://jenkinsci.github.io/job-dsl-plugin/#method/javaposse.jobdsl.dsl.jobs.FreeStyleJob.throttleConcurrentBuilds
>>>>>
>>>>>
>>>>> On Mon, Jul 2, 2018 at 8:33 AM Lukasz Cwik <lc...@google.com> wrote:
>>>>>
>>>>>> The validates runner test parallelism is controlled here and is
>>>>>> currently set to be "unlimited":
>>>>>>
>>>>>> https://github.com/apache/beam/blob/fbfe6ceaea9d99cb1c8964087aafaa2bc2297a03/runners/google-cloud-dataflow-java/build.gradle#L115
>>>>>>
>>>>>> Each test fork is run on a different gradle worker, so the number of
>>>>>> parallel test runs is limited to the max number of workers configured which
>>>>>> is controlled here:
>>>>>>
>>>>>> https://github.com/apache/beam/blob/fbfe6ceaea9d99cb1c8964087aafaa2bc2297a03/.test-infra/jenkins/job_PostCommit_Java_ValidatesRunner_Dataflow.groovy#L50
>>>>>> It is currently configured to 3 * number of CPU cores.
>>>>>>
>>>>>> We are already running up to 48 Dataflow jobs in parallel.
>>>>>>
>>>>>>
>>>>>> On Sat, Jun 30, 2018 at 9:51 AM Rafael Fernandez <rf...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> - How many resources to ValidatesRunner tests use?
>>>>>>> - Where are those settings?
>>>>>>>
>>>>>>> On Sat, Jun 30, 2018 at 9:50 AM Reuven Lax <re...@google.com> wrote:
>>>>>>>
>>>>>>>> The specific issue only affects Dataflow ValidatesRunner tests. We
>>>>>>>> currently allow only one of these to run at a time, to control usage of
>>>>>>>> Dataflow and of GCE quota. Other types of tests do not suffer from this
>>>>>>>> issue.
>>>>>>>>
>>>>>>>> I would like to see if it's possible to increase Dataflow quota so
>>>>>>>> we can run more of these in parallel. It took me 8 hours end to end to run
>>>>>>>> these tests (about 6 hours for the run to be scheduled). If there was a
>>>>>>>> failure, I would have had to repeat the whole process. In the worst case,
>>>>>>>> this process could have taken me days. While this is not as pressing as
>>>>>>>> some other issues (as most people don't need to run the Dataflow tests on
>>>>>>>> every PR), fixing it would make such changes much easier to manage.
>>>>>>>>
>>>>>>>> Reuven
>>>>>>>>
>>>>>>>> On Sat, Jun 30, 2018 at 9:32 AM Rafael Fernandez <
>>>>>>>> rfernand@google.com> wrote:
>>>>>>>>
>>>>>>>>> +Reuven Lax <re...@google.com> told me yesterday that he was
>>>>>>>>> waiting for some test to be scheduled and run, and it took 6 hours or so. I
>>>>>>>>> would like to help reduce these wait times by increasing parallelism. I
>>>>>>>>> need help understanding the continuous minimum of what we use. It seems the
>>>>>>>>> following is true:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    - There seems to always be 16 jenkins machines on (16 CPUs
>>>>>>>>>    each)
>>>>>>>>>    - There seems to be three GKE machines always on (1 CPU each)
>>>>>>>>>    - Most (if not all) unit tests run on 1 machine, and seem to
>>>>>>>>>    run one-at-a-time <-- I think we can safely parallelize this to 20.
>>>>>>>>>
>>>>>>>>> With current quotas, if we parallelize to 20 concurrent unit
>>>>>>>>> tests, we still have room for 80 other concurrent dataflow jobs to execute,
>>>>>>>>> with 75% of CPU capacity.
>>>>>>>>>
>>>>>>>>> Thoughts? Additional data?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> r
>>>>>>>>>
>>>>>>>>

Re: Parallelizing test runs

Posted by Rafael Fernandez <rf...@google.com>.
Done!

On Mon, Jul 2, 2018 at 4:10 PM Scott Wegner <sc...@apache.org> wrote:

> Hey Rafael, looks like we need more 'INSTANCE_TEMPLATES' quota [1]. Can
> you take a look? I've filed [BEAM-4722]:
> https://issues.apache.org/jira/browse/BEAM-4722
>
> [1] https://github.com/apache/beam/pull/5861#issuecomment-401963630
>
> On Mon, Jul 2, 2018 at 11:33 AM Rafael Fernandez <rf...@google.com>
> wrote:
>
>> OK, Scott just sent https://github.com/apache/beam/pull/5860 . Quotas
>> should not be a problem, if they are, please file a JIRA under gcp-quota.
>>
>> Cheers,
>> r
>>
>> On Mon, Jul 2, 2018 at 10:06 AM Kenneth Knowles <kl...@google.com> wrote:
>>
>>> One thing that is nice when you do this is to be able to share your
>>> results. Though if all you are sharing is "they passed" then I guess we
>>> don't have to insist on evidence.
>>>
>>> Kenn
>>>
>>> On Mon, Jul 2, 2018 at 9:25 AM Scott Wegner <sc...@apache.org> wrote:
>>>
>>>> A few thoughts:
>>>>
>>>> * The Jenkins job getting backed up
>>>> is beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle_PR [1]. Since
>>>> Mikhail refactored Jenkins jobs, this only runs when explicitly requested
>>>> via "Run Dataflow ValidatesRunner", and only has 8 total runs. So this job
>>>> is idle more often than backlogged.
>>>>
>>>> * It's difficult to reason about our exact quota needs because Dataflow
>>>> jobs get launched from various Jenkins jobs that have different parallelism
>>>> configurations. If we have budget, we could enable concurrent execution of
>>>> this job and increase our quota enough to give some breathing room. If we
>>>> do this, I recommend limiting the max concurrency via
>>>> throttleConcurrentBuilds [2] to some reasonable limit.
>>>>
>>>> * This test suite is meant to be an exhaustive post-commit validation
>>>> of Dataflow runner, and tests a lot of different aspects of a runner. It
>>>> would be more efficient to run locally only the tests affected by your
>>>> change. Note that this requires having access to a GCP project with
>>>> billing, but most Dataflow developers probably have access to this already.
>>>> The command for this is:
>>>>
>>>> ./gradlew :beam-runners-google-cloud-dataflow-java:validatesRunner
>>>> -PdataflowProject=myGcpProject -PdataflowTempRoot=gs://myGcsTempRoot
>>>> --tests "org.apache.beam.MyTestClass"
>>>>
>>>> [1]
>>>> https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle_PR/buildTimeTrend
>>>> [2]
>>>> https://jenkinsci.github.io/job-dsl-plugin/#method/javaposse.jobdsl.dsl.jobs.FreeStyleJob.throttleConcurrentBuilds
>>>>
>>>>
>>>> On Mon, Jul 2, 2018 at 8:33 AM Lukasz Cwik <lc...@google.com> wrote:
>>>>
>>>>> The validates runner test parallelism is controlled here and is
>>>>> currently set to be "unlimited":
>>>>>
>>>>> https://github.com/apache/beam/blob/fbfe6ceaea9d99cb1c8964087aafaa2bc2297a03/runners/google-cloud-dataflow-java/build.gradle#L115
>>>>>
>>>>> Each test fork is run on a different gradle worker, so the number of
>>>>> parallel test runs is limited to the max number of workers configured which
>>>>> is controlled here:
>>>>>
>>>>> https://github.com/apache/beam/blob/fbfe6ceaea9d99cb1c8964087aafaa2bc2297a03/.test-infra/jenkins/job_PostCommit_Java_ValidatesRunner_Dataflow.groovy#L50
>>>>> It is currently configured to 3 * number of CPU cores.
>>>>>
>>>>> We are already running up to 48 Dataflow jobs in parallel.
>>>>>
>>>>>
>>>>> On Sat, Jun 30, 2018 at 9:51 AM Rafael Fernandez <rf...@google.com>
>>>>> wrote:
>>>>>
>>>>>> - How many resources to ValidatesRunner tests use?
>>>>>> - Where are those settings?
>>>>>>
>>>>>> On Sat, Jun 30, 2018 at 9:50 AM Reuven Lax <re...@google.com> wrote:
>>>>>>
>>>>>>> The specific issue only affects Dataflow ValidatesRunner tests. We
>>>>>>> currently allow only one of these to run at a time, to control usage of
>>>>>>> Dataflow and of GCE quota. Other types of tests do not suffer from this
>>>>>>> issue.
>>>>>>>
>>>>>>> I would like to see if it's possible to increase Dataflow quota so
>>>>>>> we can run more of these in parallel. It took me 8 hours end to end to run
>>>>>>> these tests (about 6 hours for the run to be scheduled). If there was a
>>>>>>> failure, I would have had to repeat the whole process. In the worst case,
>>>>>>> this process could have taken me days. While this is not as pressing as
>>>>>>> some other issues (as most people don't need to run the Dataflow tests on
>>>>>>> every PR), fixing it would make such changes much easier to manage.
>>>>>>>
>>>>>>> Reuven
>>>>>>>
>>>>>>> On Sat, Jun 30, 2018 at 9:32 AM Rafael Fernandez <
>>>>>>> rfernand@google.com> wrote:
>>>>>>>
>>>>>>>> +Reuven Lax <re...@google.com> told me yesterday that he was
>>>>>>>> waiting for some test to be scheduled and run, and it took 6 hours or so. I
>>>>>>>> would like to help reduce these wait times by increasing parallelism. I
>>>>>>>> need help understanding the continuous minimum of what we use. It seems the
>>>>>>>> following is true:
>>>>>>>>
>>>>>>>>
>>>>>>>>    - There seems to always be 16 jenkins machines on (16 CPUs each)
>>>>>>>>    - There seems to be three GKE machines always on (1 CPU each)
>>>>>>>>    - Most (if not all) unit tests run on 1 machine, and seem to
>>>>>>>>    run one-at-a-time <-- I think we can safely parallelize this to 20.
>>>>>>>>
>>>>>>>> With current quotas, if we parallelize to 20 concurrent unit tests,
>>>>>>>> we still have room for 80 other concurrent dataflow jobs to execute, with
>>>>>>>> 75% of CPU capacity.
>>>>>>>>
>>>>>>>> Thoughts? Additional data?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> r
>>>>>>>>
>>>>>>>

Re: Parallelizing test runs

Posted by Scott Wegner <sc...@apache.org>.
Hey Rafael, looks like we need more 'INSTANCE_TEMPLATES' quota [1]. Can you
take a look? I've filed [BEAM-4722]:
https://issues.apache.org/jira/browse/BEAM-4722

[1] https://github.com/apache/beam/pull/5861#issuecomment-401963630

On Mon, Jul 2, 2018 at 11:33 AM Rafael Fernandez <rf...@google.com>
wrote:

> OK, Scott just sent https://github.com/apache/beam/pull/5860 . Quotas
> should not be a problem, if they are, please file a JIRA under gcp-quota.
>
> Cheers,
> r
>
> On Mon, Jul 2, 2018 at 10:06 AM Kenneth Knowles <kl...@google.com> wrote:
>
>> One thing that is nice when you do this is to be able to share your
>> results. Though if all you are sharing is "they passed" then I guess we
>> don't have to insist on evidence.
>>
>> Kenn
>>
>> On Mon, Jul 2, 2018 at 9:25 AM Scott Wegner <sc...@apache.org> wrote:
>>
>>> A few thoughts:
>>>
>>> * The Jenkins job getting backed up
>>> is beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle_PR [1]. Since
>>> Mikhail refactored Jenkins jobs, this only runs when explicitly requested
>>> via "Run Dataflow ValidatesRunner", and only has 8 total runs. So this job
>>> is idle more often than backlogged.
>>>
>>> * It's difficult to reason about our exact quota needs because Dataflow
>>> jobs get launched from various Jenkins jobs that have different parallelism
>>> configurations. If we have budget, we could enable concurrent execution of
>>> this job and increase our quota enough to give some breathing room. If we
>>> do this, I recommend limiting the max concurrency via
>>> throttleConcurrentBuilds [2] to some reasonable limit.
>>>
>>> * This test suite is meant to be an exhaustive post-commit validation of
>>> Dataflow runner, and tests a lot of different aspects of a runner. It would
>>> be more efficient to run locally only the tests affected by your change.
>>> Note that this requires having access to a GCP project with billing, but
>>> most Dataflow developers probably have access to this already. The command
>>> for this is:
>>>
>>> ./gradlew :beam-runners-google-cloud-dataflow-java:validatesRunner
>>> -PdataflowProject=myGcpProject -PdataflowTempRoot=gs://myGcsTempRoot
>>> --tests "org.apache.beam.MyTestClass"
>>>
>>> [1]
>>> https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle_PR/buildTimeTrend
>>> [2]
>>> https://jenkinsci.github.io/job-dsl-plugin/#method/javaposse.jobdsl.dsl.jobs.FreeStyleJob.throttleConcurrentBuilds
>>>
>>>
>>> On Mon, Jul 2, 2018 at 8:33 AM Lukasz Cwik <lc...@google.com> wrote:
>>>
>>>> The validates runner test parallelism is controlled here and is
>>>> currently set to be "unlimited":
>>>>
>>>> https://github.com/apache/beam/blob/fbfe6ceaea9d99cb1c8964087aafaa2bc2297a03/runners/google-cloud-dataflow-java/build.gradle#L115
>>>>
>>>> Each test fork is run on a different gradle worker, so the number of
>>>> parallel test runs is limited to the max number of workers configured which
>>>> is controlled here:
>>>>
>>>> https://github.com/apache/beam/blob/fbfe6ceaea9d99cb1c8964087aafaa2bc2297a03/.test-infra/jenkins/job_PostCommit_Java_ValidatesRunner_Dataflow.groovy#L50
>>>> It is currently configured to 3 * number of CPU cores.
>>>>
>>>> We are already running up to 48 Dataflow jobs in parallel.
>>>>
>>>>
>>>> On Sat, Jun 30, 2018 at 9:51 AM Rafael Fernandez <rf...@google.com>
>>>> wrote:
>>>>
>>>>> - How many resources to ValidatesRunner tests use?
>>>>> - Where are those settings?
>>>>>
>>>>> On Sat, Jun 30, 2018 at 9:50 AM Reuven Lax <re...@google.com> wrote:
>>>>>
>>>>>> The specific issue only affects Dataflow ValidatesRunner tests. We
>>>>>> currently allow only one of these to run at a time, to control usage of
>>>>>> Dataflow and of GCE quota. Other types of tests do not suffer from this
>>>>>> issue.
>>>>>>
>>>>>> I would like to see if it's possible to increase Dataflow quota so we
>>>>>> can run more of these in parallel. It took me 8 hours end to end to run
>>>>>> these tests (about 6 hours for the run to be scheduled). If there was a
>>>>>> failure, I would have had to repeat the whole process. In the worst case,
>>>>>> this process could have taken me days. While this is not as pressing as
>>>>>> some other issues (as most people don't need to run the Dataflow tests on
>>>>>> every PR), fixing it would make such changes much easier to manage.
>>>>>>
>>>>>> Reuven
>>>>>>
>>>>>> On Sat, Jun 30, 2018 at 9:32 AM Rafael Fernandez <rf...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> +Reuven Lax <re...@google.com> told me yesterday that he was
>>>>>>> waiting for some test to be scheduled and run, and it took 6 hours or so. I
>>>>>>> would like to help reduce these wait times by increasing parallelism. I
>>>>>>> need help understanding the continuous minimum of what we use. It seems the
>>>>>>> following is true:
>>>>>>>
>>>>>>>
>>>>>>>    - There seems to always be 16 jenkins machines on (16 CPUs each)
>>>>>>>    - There seems to be three GKE machines always on (1 CPU each)
>>>>>>>    - Most (if not all) unit tests run on 1 machine, and seem to run
>>>>>>>    one-at-a-time <-- I think we can safely parallelize this to 20.
>>>>>>>
>>>>>>> With current quotas, if we parallelize to 20 concurrent unit tests,
>>>>>>> we still have room for 80 other concurrent dataflow jobs to execute, with
>>>>>>> 75% of CPU capacity.
>>>>>>>
>>>>>>> Thoughts? Additional data?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> r
>>>>>>>
>>>>>>

Re: Parallelizing test runs

Posted by Rafael Fernandez <rf...@google.com>.
OK, Scott just sent https://github.com/apache/beam/pull/5860 . Quotas
should not be a problem, if they are, please file a JIRA under gcp-quota.

Cheers,
r

On Mon, Jul 2, 2018 at 10:06 AM Kenneth Knowles <kl...@google.com> wrote:

> One thing that is nice when you do this is to be able to share your
> results. Though if all you are sharing is "they passed" then I guess we
> don't have to insist on evidence.
>
> Kenn
>
> On Mon, Jul 2, 2018 at 9:25 AM Scott Wegner <sc...@apache.org> wrote:
>
>> A few thoughts:
>>
>> * The Jenkins job getting backed up
>> is beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle_PR [1]. Since
>> Mikhail refactored Jenkins jobs, this only runs when explicitly requested
>> via "Run Dataflow ValidatesRunner", and only has 8 total runs. So this job
>> is idle more often than backlogged.
>>
>> * It's difficult to reason about our exact quota needs because Dataflow
>> jobs get launched from various Jenkins jobs that have different parallelism
>> configurations. If we have budget, we could enable concurrent execution of
>> this job and increase our quota enough to give some breathing room. If we
>> do this, I recommend limiting the max concurrency via
>> throttleConcurrentBuilds [2] to some reasonable limit.
>>
>> * This test suite is meant to be an exhaustive post-commit validation of
>> Dataflow runner, and tests a lot of different aspects of a runner. It would
>> be more efficient to run locally only the tests affected by your change.
>> Note that this requires having access to a GCP project with billing, but
>> most Dataflow developers probably have access to this already. The command
>> for this is:
>>
>> ./gradlew :beam-runners-google-cloud-dataflow-java:validatesRunner
>> -PdataflowProject=myGcpProject -PdataflowTempRoot=gs://myGcsTempRoot
>> --tests "org.apache.beam.MyTestClass"
>>
>> [1]
>> https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle_PR/buildTimeTrend
>> [2]
>> https://jenkinsci.github.io/job-dsl-plugin/#method/javaposse.jobdsl.dsl.jobs.FreeStyleJob.throttleConcurrentBuilds
>>
>>
>> On Mon, Jul 2, 2018 at 8:33 AM Lukasz Cwik <lc...@google.com> wrote:
>>
>>> The validates runner test parallelism is controlled here and is
>>> currently set to be "unlimited":
>>>
>>> https://github.com/apache/beam/blob/fbfe6ceaea9d99cb1c8964087aafaa2bc2297a03/runners/google-cloud-dataflow-java/build.gradle#L115
>>>
>>> Each test fork is run on a different gradle worker, so the number of
>>> parallel test runs is limited to the max number of workers configured which
>>> is controlled here:
>>>
>>> https://github.com/apache/beam/blob/fbfe6ceaea9d99cb1c8964087aafaa2bc2297a03/.test-infra/jenkins/job_PostCommit_Java_ValidatesRunner_Dataflow.groovy#L50
>>> It is currently configured to 3 * number of CPU cores.
>>>
>>> We are already running up to 48 Dataflow jobs in parallel.
>>>
>>>
>>> On Sat, Jun 30, 2018 at 9:51 AM Rafael Fernandez <rf...@google.com>
>>> wrote:
>>>
>>>> - How many resources to ValidatesRunner tests use?
>>>> - Where are those settings?
>>>>
>>>> On Sat, Jun 30, 2018 at 9:50 AM Reuven Lax <re...@google.com> wrote:
>>>>
>>>>> The specific issue only affects Dataflow ValidatesRunner tests. We
>>>>> currently allow only one of these to run at a time, to control usage of
>>>>> Dataflow and of GCE quota. Other types of tests do not suffer from this
>>>>> issue.
>>>>>
>>>>> I would like to see if it's possible to increase Dataflow quota so we
>>>>> can run more of these in parallel. It took me 8 hours end to end to run
>>>>> these tests (about 6 hours for the run to be scheduled). If there was a
>>>>> failure, I would have had to repeat the whole process. In the worst case,
>>>>> this process could have taken me days. While this is not as pressing as
>>>>> some other issues (as most people don't need to run the Dataflow tests on
>>>>> every PR), fixing it would make such changes much easier to manage.
>>>>>
>>>>> Reuven
>>>>>
>>>>> On Sat, Jun 30, 2018 at 9:32 AM Rafael Fernandez <rf...@google.com>
>>>>> wrote:
>>>>>
>>>>>> +Reuven Lax <re...@google.com> told me yesterday that he was waiting
>>>>>> for some test to be scheduled and run, and it took 6 hours or so. I would
>>>>>> like to help reduce these wait times by increasing parallelism. I need help
>>>>>> understanding the continuous minimum of what we use. It seems the following
>>>>>> is true:
>>>>>>
>>>>>>
>>>>>>    - There seems to always be 16 jenkins machines on (16 CPUs each)
>>>>>>    - There seems to be three GKE machines always on (1 CPU each)
>>>>>>    - Most (if not all) unit tests run on 1 machine, and seem to run
>>>>>>    one-at-a-time <-- I think we can safely parallelize this to 20.
>>>>>>
>>>>>> With current quotas, if we parallelize to 20 concurrent unit tests,
>>>>>> we still have room for 80 other concurrent dataflow jobs to execute, with
>>>>>> 75% of CPU capacity.
>>>>>>
>>>>>> Thoughts? Additional data?
>>>>>>
>>>>>> Thanks,
>>>>>> r
>>>>>>
>>>>>

Re: Parallelizing test runs

Posted by Kenneth Knowles <kl...@google.com>.
One thing that is nice when you do this is to be able to share your
results. Though if all you are sharing is "they passed" then I guess we
don't have to insist on evidence.

Kenn

On Mon, Jul 2, 2018 at 9:25 AM Scott Wegner <sc...@apache.org> wrote:

> A few thoughts:
>
> * The Jenkins job getting backed up
> is beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle_PR [1]. Since
> Mikhail refactored Jenkins jobs, this only runs when explicitly requested
> via "Run Dataflow ValidatesRunner", and only has 8 total runs. So this job
> is idle more often than backlogged.
>
> * It's difficult to reason about our exact quota needs because Dataflow
> jobs get launched from various Jenkins jobs that have different parallelism
> configurations. If we have budget, we could enable concurrent execution of
> this job and increase our quota enough to give some breathing room. If we
> do this, I recommend limiting the max concurrency via
> throttleConcurrentBuilds [2] to some reasonable limit.
>
> * This test suite is meant to be an exhaustive post-commit validation of
> Dataflow runner, and tests a lot of different aspects of a runner. It would
> be more efficient to run locally only the tests affected by your change.
> Note that this requires having access to a GCP project with billing, but
> most Dataflow developers probably have access to this already. The command
> for this is:
>
> ./gradlew :beam-runners-google-cloud-dataflow-java:validatesRunner
> -PdataflowProject=myGcpProject -PdataflowTempRoot=gs://myGcsTempRoot
> --tests "org.apache.beam.MyTestClass"
>
> [1]
> https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle_PR/buildTimeTrend
> [2]
> https://jenkinsci.github.io/job-dsl-plugin/#method/javaposse.jobdsl.dsl.jobs.FreeStyleJob.throttleConcurrentBuilds
>
>
> On Mon, Jul 2, 2018 at 8:33 AM Lukasz Cwik <lc...@google.com> wrote:
>
>> The validates runner test parallelism is controlled here and is currently
>> set to be "unlimited":
>>
>> https://github.com/apache/beam/blob/fbfe6ceaea9d99cb1c8964087aafaa2bc2297a03/runners/google-cloud-dataflow-java/build.gradle#L115
>>
>> Each test fork is run on a different gradle worker, so the number of
>> parallel test runs is limited to the max number of workers configured which
>> is controlled here:
>>
>> https://github.com/apache/beam/blob/fbfe6ceaea9d99cb1c8964087aafaa2bc2297a03/.test-infra/jenkins/job_PostCommit_Java_ValidatesRunner_Dataflow.groovy#L50
>> It is currently configured to 3 * number of CPU cores.
>>
>> We are already running up to 48 Dataflow jobs in parallel.
>>
>>
>> On Sat, Jun 30, 2018 at 9:51 AM Rafael Fernandez <rf...@google.com>
>> wrote:
>>
>>> - How many resources to ValidatesRunner tests use?
>>> - Where are those settings?
>>>
>>> On Sat, Jun 30, 2018 at 9:50 AM Reuven Lax <re...@google.com> wrote:
>>>
>>>> The specific issue only affects Dataflow ValidatesRunner tests. We
>>>> currently allow only one of these to run at a time, to control usage of
>>>> Dataflow and of GCE quota. Other types of tests do not suffer from this
>>>> issue.
>>>>
>>>> I would like to see if it's possible to increase Dataflow quota so we
>>>> can run more of these in parallel. It took me 8 hours end to end to run
>>>> these tests (about 6 hours for the run to be scheduled). If there was a
>>>> failure, I would have had to repeat the whole process. In the worst case,
>>>> this process could have taken me days. While this is not as pressing as
>>>> some other issues (as most people don't need to run the Dataflow tests on
>>>> every PR), fixing it would make such changes much easier to manage.
>>>>
>>>> Reuven
>>>>
>>>> On Sat, Jun 30, 2018 at 9:32 AM Rafael Fernandez <rf...@google.com>
>>>> wrote:
>>>>
>>>>> +Reuven Lax <re...@google.com> told me yesterday that he was waiting
>>>>> for some test to be scheduled and run, and it took 6 hours or so. I would
>>>>> like to help reduce these wait times by increasing parallelism. I need help
>>>>> understanding the continuous minimum of what we use. It seems the following
>>>>> is true:
>>>>>
>>>>>
>>>>>    - There seems to always be 16 jenkins machines on (16 CPUs each)
>>>>>    - There seems to be three GKE machines always on (1 CPU each)
>>>>>    - Most (if not all) unit tests run on 1 machine, and seem to run
>>>>>    one-at-a-time <-- I think we can safely parallelize this to 20.
>>>>>
>>>>> With current quotas, if we parallelize to 20 concurrent unit tests, we
>>>>> still have room for 80 other concurrent dataflow jobs to execute, with 75%
>>>>> of CPU capacity.
>>>>>
>>>>> Thoughts? Additional data?
>>>>>
>>>>> Thanks,
>>>>> r
>>>>>
>>>>

Re: Parallelizing test runs

Posted by Scott Wegner <sc...@apache.org>.
A few thoughts:

* The Jenkins job getting backed up
is beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle_PR [1]. Since
Mikhail refactored Jenkins jobs, this only runs when explicitly requested
via "Run Dataflow ValidatesRunner", and only has 8 total runs. So this job
is idle more often than backlogged.

* It's difficult to reason about our exact quota needs because Dataflow
jobs get launched from various Jenkins jobs that have different parallelism
configurations. If we have budget, we could enable concurrent execution of
this job and increase our quota enough to give some breathing room. If we
do this, I recommend limiting the max concurrency via
throttleConcurrentBuilds [2] to some reasonable limit.

* This test suite is meant to be an exhaustive post-commit validation of
Dataflow runner, and tests a lot of different aspects of a runner. It would
be more efficient to run locally only the tests affected by your change.
Note that this requires having access to a GCP project with billing, but
most Dataflow developers probably have access to this already. The command
for this is:

./gradlew :beam-runners-google-cloud-dataflow-java:validatesRunner
-PdataflowProject=myGcpProject -PdataflowTempRoot=gs://myGcsTempRoot
--tests "org.apache.beam.MyTestClass"

[1]
https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle_PR/buildTimeTrend
[2]
https://jenkinsci.github.io/job-dsl-plugin/#method/javaposse.jobdsl.dsl.jobs.FreeStyleJob.throttleConcurrentBuilds


On Mon, Jul 2, 2018 at 8:33 AM Lukasz Cwik <lc...@google.com> wrote:

> The validates runner test parallelism is controlled here and is currently
> set to be "unlimited":
>
> https://github.com/apache/beam/blob/fbfe6ceaea9d99cb1c8964087aafaa2bc2297a03/runners/google-cloud-dataflow-java/build.gradle#L115
>
> Each test fork is run on a different gradle worker, so the number of
> parallel test runs is limited to the max number of workers configured which
> is controlled here:
>
> https://github.com/apache/beam/blob/fbfe6ceaea9d99cb1c8964087aafaa2bc2297a03/.test-infra/jenkins/job_PostCommit_Java_ValidatesRunner_Dataflow.groovy#L50
> It is currently configured to 3 * number of CPU cores.
>
> We are already running up to 48 Dataflow jobs in parallel.
>
>
> On Sat, Jun 30, 2018 at 9:51 AM Rafael Fernandez <rf...@google.com>
> wrote:
>
>> - How many resources to ValidatesRunner tests use?
>> - Where are those settings?
>>
>> On Sat, Jun 30, 2018 at 9:50 AM Reuven Lax <re...@google.com> wrote:
>>
>>> The specific issue only affects Dataflow ValidatesRunner tests. We
>>> currently allow only one of these to run at a time, to control usage of
>>> Dataflow and of GCE quota. Other types of tests do not suffer from this
>>> issue.
>>>
>>> I would like to see if it's possible to increase Dataflow quota so we
>>> can run more of these in parallel. It took me 8 hours end to end to run
>>> these tests (about 6 hours for the run to be scheduled). If there was a
>>> failure, I would have had to repeat the whole process. In the worst case,
>>> this process could have taken me days. While this is not as pressing as
>>> some other issues (as most people don't need to run the Dataflow tests on
>>> every PR), fixing it would make such changes much easier to manage.
>>>
>>> Reuven
>>>
>>> On Sat, Jun 30, 2018 at 9:32 AM Rafael Fernandez <rf...@google.com>
>>> wrote:
>>>
>>>> +Reuven Lax <re...@google.com> told me yesterday that he was waiting
>>>> for some test to be scheduled and run, and it took 6 hours or so. I would
>>>> like to help reduce these wait times by increasing parallelism. I need help
>>>> understanding the continuous minimum of what we use. It seems the following
>>>> is true:
>>>>
>>>>
>>>>    - There seems to always be 16 jenkins machines on (16 CPUs each)
>>>>    - There seems to be three GKE machines always on (1 CPU each)
>>>>    - Most (if not all) unit tests run on 1 machine, and seem to run
>>>>    one-at-a-time <-- I think we can safely parallelize this to 20.
>>>>
>>>> With current quotas, if we parallelize to 20 concurrent unit tests, we
>>>> still have room for 80 other concurrent dataflow jobs to execute, with 75%
>>>> of CPU capacity.
>>>>
>>>> Thoughts? Additional data?
>>>>
>>>> Thanks,
>>>> r
>>>>
>>>

Re: Parallelizing test runs

Posted by Lukasz Cwik <lc...@google.com>.
The validates runner test parallelism is controlled here and is currently
set to be "unlimited":
https://github.com/apache/beam/blob/fbfe6ceaea9d99cb1c8964087aafaa2bc2297a03/runners/google-cloud-dataflow-java/build.gradle#L115

Each test fork is run on a different gradle worker, so the number of
parallel test runs is limited to the max number of workers configured which
is controlled here:
https://github.com/apache/beam/blob/fbfe6ceaea9d99cb1c8964087aafaa2bc2297a03/.test-infra/jenkins/job_PostCommit_Java_ValidatesRunner_Dataflow.groovy#L50
It is currently configured to 3 * number of CPU cores.

We are already running up to 48 Dataflow jobs in parallel.


On Sat, Jun 30, 2018 at 9:51 AM Rafael Fernandez <rf...@google.com>
wrote:

> - How many resources to ValidatesRunner tests use?
> - Where are those settings?
>
> On Sat, Jun 30, 2018 at 9:50 AM Reuven Lax <re...@google.com> wrote:
>
>> The specific issue only affects Dataflow ValidatesRunner tests. We
>> currently allow only one of these to run at a time, to control usage of
>> Dataflow and of GCE quota. Other types of tests do not suffer from this
>> issue.
>>
>> I would like to see if it's possible to increase Dataflow quota so we can
>> run more of these in parallel. It took me 8 hours end to end to run these
>> tests (about 6 hours for the run to be scheduled). If there was a failure,
>> I would have had to repeat the whole process. In the worst case, this
>> process could have taken me days. While this is not as pressing as some
>> other issues (as most people don't need to run the Dataflow tests on every
>> PR), fixing it would make such changes much easier to manage.
>>
>> Reuven
>>
>> On Sat, Jun 30, 2018 at 9:32 AM Rafael Fernandez <rf...@google.com>
>> wrote:
>>
>>> +Reuven Lax <re...@google.com> told me yesterday that he was waiting
>>> for some test to be scheduled and run, and it took 6 hours or so. I would
>>> like to help reduce these wait times by increasing parallelism. I need help
>>> understanding the continuous minimum of what we use. It seems the following
>>> is true:
>>>
>>>
>>>    - There seems to always be 16 jenkins machines on (16 CPUs each)
>>>    - There seems to be three GKE machines always on (1 CPU each)
>>>    - Most (if not all) unit tests run on 1 machine, and seem to run
>>>    one-at-a-time <-- I think we can safely parallelize this to 20.
>>>
>>> With current quotas, if we parallelize to 20 concurrent unit tests, we
>>> still have room for 80 other concurrent dataflow jobs to execute, with 75%
>>> of CPU capacity.
>>>
>>> Thoughts? Additional data?
>>>
>>> Thanks,
>>> r
>>>
>>

Re: Parallelizing test runs

Posted by Rafael Fernandez <rf...@google.com>.
- How many resources to ValidatesRunner tests use?
- Where are those settings?

On Sat, Jun 30, 2018 at 9:50 AM Reuven Lax <re...@google.com> wrote:

> The specific issue only affects Dataflow ValidatesRunner tests. We
> currently allow only one of these to run at a time, to control usage of
> Dataflow and of GCE quota. Other types of tests do not suffer from this
> issue.
>
> I would like to see if it's possible to increase Dataflow quota so we can
> run more of these in parallel. It took me 8 hours end to end to run these
> tests (about 6 hours for the run to be scheduled). If there was a failure,
> I would have had to repeat the whole process. In the worst case, this
> process could have taken me days. While this is not as pressing as some
> other issues (as most people don't need to run the Dataflow tests on every
> PR), fixing it would make such changes much easier to manage.
>
> Reuven
>
> On Sat, Jun 30, 2018 at 9:32 AM Rafael Fernandez <rf...@google.com>
> wrote:
>
>> +Reuven Lax <re...@google.com> told me yesterday that he was waiting for
>> some test to be scheduled and run, and it took 6 hours or so. I would like
>> to help reduce these wait times by increasing parallelism. I need help
>> understanding the continuous minimum of what we use. It seems the following
>> is true:
>>
>>
>>    - There seems to always be 16 jenkins machines on (16 CPUs each)
>>    - There seems to be three GKE machines always on (1 CPU each)
>>    - Most (if not all) unit tests run on 1 machine, and seem to run
>>    one-at-a-time <-- I think we can safely parallelize this to 20.
>>
>> With current quotas, if we parallelize to 20 concurrent unit tests, we
>> still have room for 80 other concurrent dataflow jobs to execute, with 75%
>> of CPU capacity.
>>
>> Thoughts? Additional data?
>>
>> Thanks,
>> r
>>
>

Re: Parallelizing test runs

Posted by Reuven Lax <re...@google.com>.
The specific issue only affects Dataflow ValidatesRunner tests. We
currently allow only one of these to run at a time, to control usage of
Dataflow and of GCE quota. Other types of tests do not suffer from this
issue.

I would like to see if it's possible to increase Dataflow quota so we can
run more of these in parallel. It took me 8 hours end to end to run these
tests (about 6 hours for the run to be scheduled). If there was a failure,
I would have had to repeat the whole process. In the worst case, this
process could have taken me days. While this is not as pressing as some
other issues (as most people don't need to run the Dataflow tests on every
PR), fixing it would make such changes much easier to manage.

Reuven

On Sat, Jun 30, 2018 at 9:32 AM Rafael Fernandez <rf...@google.com>
wrote:

> +Reuven Lax <re...@google.com> told me yesterday that he was waiting for
> some test to be scheduled and run, and it took 6 hours or so. I would like
> to help reduce these wait times by increasing parallelism. I need help
> understanding the continuous minimum of what we use. It seems the following
> is true:
>
>
>    - There seems to always be 16 jenkins machines on (16 CPUs each)
>    - There seems to be three GKE machines always on (1 CPU each)
>    - Most (if not all) unit tests run on 1 machine, and seem to run
>    one-at-a-time <-- I think we can safely parallelize this to 20.
>
> With current quotas, if we parallelize to 20 concurrent unit tests, we
> still have room for 80 other concurrent dataflow jobs to execute, with 75%
> of CPU capacity.
>
> Thoughts? Additional data?
>
> Thanks,
> r
>