You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Łukasz Gajowy <lg...@apache.org> on 2019/01/14 16:11:12 UTC

Dealing with expensive jenkins + Dataflow jobs

Hi all,

one problem we need to solve while working with load tests we currently
develop is that we don't really know how much GCP/Jenkins resources can we
occupy. We did some initial testing with
beam_Java_LoadTests_GroupByKey_Dataflow_Small[1] and it seems that for:

- 1 000 000 000 (~ 23 GB) synthetic record
- 10 fanouts
- 10 dataflow workers (--maxNumWorkers)

the total job time exceeds 4 hours. It seems too much for such a small load
test. Additionally, we plan to add much bigger tests for other core
operations too. The proposal [2] describes only few of them.

The questions are:
1. how many workers can we assign to this job without starving the other
jobs? Are 32 workers for a single Dataflow job fine? Would 64 workers for
such job be fine either?
2. given the plans that we are going to add more and more load tests soon,
do you think it is a good idea to create a separate GCP project + separate
Jenkins workers for load testing purposes only? This would avoid starvation
of critical tests (post commits, pre-commits, etc). Or maybe there is
another solution that will bring such isolation? Is such isolation needed?

Ad 2: Please note that we will also need to host Flink/Spark clusters later
on GKE/Dataproc (not decided yet).

[1]
https://builds.apache.org/view/A-D/view/Beam/view/All/job/beam_Java_LoadTests_GroupByKey_Dataflow_Small_PR/
[2] https://s.apache.org/load-test-basic-operations


Thanks,
Łukasz

Re: Dealing with expensive jenkins + Dataflow jobs

Posted by Łukasz Gajowy <lg...@apache.org>.
Thanks for your suggestions. It's always good to reach the dev list! You're
right that we should focus more on what are we trying to test rather than
providing huge loads.

To stay transparent for everyone:

"what is it we're trying to test?"

I talked with some testing experts from the Dataflow team and did some
experiments. Then I improved the proposal doc to explain better what is the
goal (previously it was not clear at all).
At the end of the doc [1] you can find a table with proposed test suites
for GroupByKey. I scaled the test scenarios down drastically but ensured
that we're testing what we want to test. Thanks to that they are not using
lots of resources but still do the job. Feel free to comment there
especially if you see any shortcomings we should rethink.

As for other operations (CoGBK, ParDo, SideInput, Combine): those are yet
to come.

(As an aside, 4 hours x 10 workers seems like a lot for 23GB of
data...or is it 230GB once you've fanned out?)

It was 230GB total - way too much given what we want to check.

[1]
https://docs.google.com/document/d/1PuIQv4v06eosKKwT76u7S6IP88AnXhTf870Rcj1AHt4/edit#heading=h.n2c0qqzfjcgz

Thanks,
Łukasz

śr., 23 sty 2019 o 19:18 Alan Myrvold <am...@google.com> napisał(a):

> Agreeing with Robert about "what is it we're trying to test?". Would a
> smaller performance test find the same issues, faster and more reliably?
>
> We have seen issues with the apache-beam-testing project exceeding quota
> during dataflow jobs, resulting in spurious failures during precommits and
> postcommits. 32 workers per dataflow jobs sounds fine, provided there are
> not too many concurrent dataflow jobs. Not all the tests have the number of
> workers limited, so I've seen some with ~80 workers. For non-performance
> tests, it would seem we should be able to drastically limit the number of
> workers, which should provide more room for performance tests/
>
> On Wed, Jan 23, 2019 at 7:10 AM Robert Bradshaw <ro...@google.com>
> wrote:
>
>> I like the idea of creating separate project(s) for load tests so as
>> to not compete with other tests and the standard development cycle.
>>
>> As for how many workers is too many, I would take the track "what is
>> it we're trying to test?" Unless your stress-testing the shuffle
>> itself, much of what Beam does is linearly parallizable with the
>> number of machines. Of course one will still want to run over real,
>> large data sets, but not every load test needs this every time. More
>> interesting could be to try out running at 2x and 4x the data, with 2x
>> and 4x the machines, and seeing where we fail to be linear.
>>
>> (As an aside, 4 hours x 10 workers seems like a lot for 23GB of
>> data...or is it 230GB once you've fanned out?)
>>
>> On Wed, Jan 23, 2019 at 3:33 PM Łukasz Gajowy <lg...@apache.org> wrote:
>> >
>> > Hi,
>> >
>> > pinging this thread (maybe some folks missed it). What do you think
>> about those concerns/ideas?
>> >
>> > Łukasz
>> >
>> > pon., 14 sty 2019 o 17:11 Łukasz Gajowy <lg...@apache.org>
>> napisał(a):
>> >>
>> >> Hi all,
>> >>
>> >> one problem we need to solve while working with load tests we
>> currently develop is that we don't really know how much GCP/Jenkins
>> resources can we occupy. We did some initial testing with
>> beam_Java_LoadTests_GroupByKey_Dataflow_Small[1] and it seems that for:
>> >>
>> >> - 1 000 000 000 (~ 23 GB) synthetic record
>> >> - 10 fanouts
>> >> - 10 dataflow workers (--maxNumWorkers)
>> >>
>> >> the total job time exceeds 4 hours. It seems too much for such a small
>> load test. Additionally, we plan to add much bigger tests for other core
>> operations too. The proposal [2] describes only few of them.
>> >>
>> >> The questions are:
>> >> 1. how many workers can we assign to this job without starving the
>> other jobs? Are 32 workers for a single Dataflow job fine? Would 64 workers
>> for such job be fine either?
>> >> 2. given the plans that we are going to add more and more load tests
>> soon, do you think it is a good idea to create a separate GCP project +
>> separate Jenkins workers for load testing purposes only? This would avoid
>> starvation of critical tests (post commits, pre-commits, etc). Or maybe
>> there is another solution that will bring such isolation? Is such isolation
>> needed?
>> >>
>> >> Ad 2: Please note that we will also need to host Flink/Spark clusters
>> later on GKE/Dataproc (not decided yet).
>> >>
>> >> [1]
>> https://builds.apache.org/view/A-D/view/Beam/view/All/job/beam_Java_LoadTests_GroupByKey_Dataflow_Small_PR/
>> >> [2] https://s.apache.org/load-test-basic-operations
>> >>
>> >>
>> >> Thanks,
>> >> Łukasz
>> >>
>>
>

Re: Dealing with expensive jenkins + Dataflow jobs

Posted by Alan Myrvold <am...@google.com>.
Agreeing with Robert about "what is it we're trying to test?". Would a
smaller performance test find the same issues, faster and more reliably?

We have seen issues with the apache-beam-testing project exceeding quota
during dataflow jobs, resulting in spurious failures during precommits and
postcommits. 32 workers per dataflow jobs sounds fine, provided there are
not too many concurrent dataflow jobs. Not all the tests have the number of
workers limited, so I've seen some with ~80 workers. For non-performance
tests, it would seem we should be able to drastically limit the number of
workers, which should provide more room for performance tests/

On Wed, Jan 23, 2019 at 7:10 AM Robert Bradshaw <ro...@google.com> wrote:

> I like the idea of creating separate project(s) for load tests so as
> to not compete with other tests and the standard development cycle.
>
> As for how many workers is too many, I would take the track "what is
> it we're trying to test?" Unless your stress-testing the shuffle
> itself, much of what Beam does is linearly parallizable with the
> number of machines. Of course one will still want to run over real,
> large data sets, but not every load test needs this every time. More
> interesting could be to try out running at 2x and 4x the data, with 2x
> and 4x the machines, and seeing where we fail to be linear.
>
> (As an aside, 4 hours x 10 workers seems like a lot for 23GB of
> data...or is it 230GB once you've fanned out?)
>
> On Wed, Jan 23, 2019 at 3:33 PM Łukasz Gajowy <lg...@apache.org> wrote:
> >
> > Hi,
> >
> > pinging this thread (maybe some folks missed it). What do you think
> about those concerns/ideas?
> >
> > Łukasz
> >
> > pon., 14 sty 2019 o 17:11 Łukasz Gajowy <lg...@apache.org> napisał(a):
> >>
> >> Hi all,
> >>
> >> one problem we need to solve while working with load tests we currently
> develop is that we don't really know how much GCP/Jenkins resources can we
> occupy. We did some initial testing with
> beam_Java_LoadTests_GroupByKey_Dataflow_Small[1] and it seems that for:
> >>
> >> - 1 000 000 000 (~ 23 GB) synthetic record
> >> - 10 fanouts
> >> - 10 dataflow workers (--maxNumWorkers)
> >>
> >> the total job time exceeds 4 hours. It seems too much for such a small
> load test. Additionally, we plan to add much bigger tests for other core
> operations too. The proposal [2] describes only few of them.
> >>
> >> The questions are:
> >> 1. how many workers can we assign to this job without starving the
> other jobs? Are 32 workers for a single Dataflow job fine? Would 64 workers
> for such job be fine either?
> >> 2. given the plans that we are going to add more and more load tests
> soon, do you think it is a good idea to create a separate GCP project +
> separate Jenkins workers for load testing purposes only? This would avoid
> starvation of critical tests (post commits, pre-commits, etc). Or maybe
> there is another solution that will bring such isolation? Is such isolation
> needed?
> >>
> >> Ad 2: Please note that we will also need to host Flink/Spark clusters
> later on GKE/Dataproc (not decided yet).
> >>
> >> [1]
> https://builds.apache.org/view/A-D/view/Beam/view/All/job/beam_Java_LoadTests_GroupByKey_Dataflow_Small_PR/
> >> [2] https://s.apache.org/load-test-basic-operations
> >>
> >>
> >> Thanks,
> >> Łukasz
> >>
>

Re: Dealing with expensive jenkins + Dataflow jobs

Posted by Robert Bradshaw <ro...@google.com>.
I like the idea of creating separate project(s) for load tests so as
to not compete with other tests and the standard development cycle.

As for how many workers is too many, I would take the track "what is
it we're trying to test?" Unless your stress-testing the shuffle
itself, much of what Beam does is linearly parallizable with the
number of machines. Of course one will still want to run over real,
large data sets, but not every load test needs this every time. More
interesting could be to try out running at 2x and 4x the data, with 2x
and 4x the machines, and seeing where we fail to be linear.

(As an aside, 4 hours x 10 workers seems like a lot for 23GB of
data...or is it 230GB once you've fanned out?)

On Wed, Jan 23, 2019 at 3:33 PM Łukasz Gajowy <lg...@apache.org> wrote:
>
> Hi,
>
> pinging this thread (maybe some folks missed it). What do you think about those concerns/ideas?
>
> Łukasz
>
> pon., 14 sty 2019 o 17:11 Łukasz Gajowy <lg...@apache.org> napisał(a):
>>
>> Hi all,
>>
>> one problem we need to solve while working with load tests we currently develop is that we don't really know how much GCP/Jenkins resources can we occupy. We did some initial testing with beam_Java_LoadTests_GroupByKey_Dataflow_Small[1] and it seems that for:
>>
>> - 1 000 000 000 (~ 23 GB) synthetic record
>> - 10 fanouts
>> - 10 dataflow workers (--maxNumWorkers)
>>
>> the total job time exceeds 4 hours. It seems too much for such a small load test. Additionally, we plan to add much bigger tests for other core operations too. The proposal [2] describes only few of them.
>>
>> The questions are:
>> 1. how many workers can we assign to this job without starving the other jobs? Are 32 workers for a single Dataflow job fine? Would 64 workers for such job be fine either?
>> 2. given the plans that we are going to add more and more load tests soon, do you think it is a good idea to create a separate GCP project + separate Jenkins workers for load testing purposes only? This would avoid starvation of critical tests (post commits, pre-commits, etc). Or maybe there is another solution that will bring such isolation? Is such isolation needed?
>>
>> Ad 2: Please note that we will also need to host Flink/Spark clusters later on GKE/Dataproc (not decided yet).
>>
>> [1] https://builds.apache.org/view/A-D/view/Beam/view/All/job/beam_Java_LoadTests_GroupByKey_Dataflow_Small_PR/
>> [2] https://s.apache.org/load-test-basic-operations
>>
>>
>> Thanks,
>> Łukasz
>>

Re: Dealing with expensive jenkins + Dataflow jobs

Posted by Łukasz Gajowy <lg...@apache.org>.
Hi,

pinging this thread (maybe some folks missed it). What do you think about
those concerns/ideas?

Łukasz

pon., 14 sty 2019 o 17:11 Łukasz Gajowy <lg...@apache.org> napisał(a):

> Hi all,
>
> one problem we need to solve while working with load tests we currently
> develop is that we don't really know how much GCP/Jenkins resources can we
> occupy. We did some initial testing with
> beam_Java_LoadTests_GroupByKey_Dataflow_Small[1] and it seems that for:
>
> - 1 000 000 000 (~ 23 GB) synthetic record
> - 10 fanouts
> - 10 dataflow workers (--maxNumWorkers)
>
> the total job time exceeds 4 hours. It seems too much for such a small
> load test. Additionally, we plan to add much bigger tests for other core
> operations too. The proposal [2] describes only few of them.
>
> The questions are:
> 1. how many workers can we assign to this job without starving the other
> jobs? Are 32 workers for a single Dataflow job fine? Would 64 workers for
> such job be fine either?
> 2. given the plans that we are going to add more and more load tests soon,
> do you think it is a good idea to create a separate GCP project + separate
> Jenkins workers for load testing purposes only? This would avoid starvation
> of critical tests (post commits, pre-commits, etc). Or maybe there is
> another solution that will bring such isolation? Is such isolation needed?
>
> Ad 2: Please note that we will also need to host Flink/Spark clusters
> later on GKE/Dataproc (not decided yet).
>
> [1]
> https://builds.apache.org/view/A-D/view/Beam/view/All/job/beam_Java_LoadTests_GroupByKey_Dataflow_Small_PR/
> [2] https://s.apache.org/load-test-basic-operations
>
>
> Thanks,
> Łukasz
>
>