You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Kenneth Knowles (Jira)" <ji...@apache.org> on 2021/03/15 21:38:00 UTC
[jira] [Updated] (BEAM-11431) Automated Release Performance Benchmark Regression/Improvement comparison

     [ https://issues.apache.org/jira/browse/BEAM-11431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kenneth Knowles updated BEAM-11431:
-----------------------------------
    Priority: P2  (was: P1)

> Automated Release Performance Benchmark Regression/Improvement comparison
> -------------------------------------------------------------------------
>
>                 Key: BEAM-11431
>                 URL: https://issues.apache.org/jira/browse/BEAM-11431
>             Project: Beam
>          Issue Type: Improvement
>          Components: testing
>            Reporter: Robert Burke
>            Priority: P2
>
> While running the release, we have a step that has us check for Performance Regressions for our releases.  [https://beam.apache.org/contribute/release-guide/#3-investigate-performance-regressions] 
> However, what we're able to check is the measured graphs over time. We don't have a clear indication of what these metrics were for the last release, only able to see vague trends in the line graph.  To be clear, the line graph is excellent at seeing large sudden changes, or small changes over a large amount of time, it doesn't help the release manager very well.
> For one: infra might have changed in the mean time, such as compilers, test machine hardware, and load variables, along with the benchmarking code itself which makes comparing any two points in those graphs very difficult. Worst, they are only ever single runs, which puts them at the mercy of variance. Changes that are invariably good in all cases are difficult.
> This Jira proposes that we should make it possible to reproducibly performance test and compare two releases. In addition, we should be able to publish the results of our benchmarks along with the rest of the release artifacts, along with the comparison to the previous release.
> Obvious caveat: If there are new tests that can't run on the previous release, (or old tests that can't run on the new release) they're free to be excluded. This can be automatic by tagging the tests somehow, or publish explicit manual exclusions or inclusions. This implies that the tests are user side, and rely on a given set of released SDK or Runner artifacts for execution.
> Ideally the release manager can run a good chunk of these tests on their local machine, or a host in the cloud. Any such cloud resources should be identical for before and after comparisons. Eg. If one is comparing Flink performance, then the same machine types should be used to compare Beam version X and X+1.
> As inspiration, a Go tool called Benchstat does what I'm talking about for the Go Benchmark format. See the descriptions in the documentation here: [https://pkg.go.dev/golang.org/x/perf/cmd/benchstat?readme=expanded#section-readme] 
> It takes the results from 1 or more runs of the given benchmark (measuring time per operation, or memory thoughput, or allocs per operation etc), on the old system, and the same from the new system, and produces averages and deltas. These are in a suitable format 
> eg.
> {{$ benchstat old.txt new.txt}}
>  {{name old time/op new time/op delta}}
>  {{GobEncode 13.6ms ± 1% 11.8ms ± 1% -13.31% (p=0.016 n=4+5)}}
>  {{JSONEncode 32.1ms ± 1% 31.8ms ± 1% ~ (p=0.286 n=4+5)}}
> This would be a valuable way to produce and present results for users, and to more easily validate what performance characteristics have changed between versions.
> Given the size and breadth and distributed nature of Beam and associated infrastructure, this is something we likely only wish to do along with the release. It will likely be time consuming, and for larger scale load tests on cloud resources, expensive. In order to make meaningful  comparisons, as much as possible needs to be invariant between the releases under comparison.
> In particular: if running on a distributed set of resources (eg cloud cluster) the machine type and numbers should remain invariant (Spark and Flink clusters should be the same size, dataflow being different is trickier but should be unrestricted, as that's the point.) Local tests on a single machine are comparable by themselves as well.
> Included in the publishing, the specifics of the machine(s) being run on should be included (CPU, clock, RAM amount, # of machines if distributed, official cloud designation if using cloud provider VMs (AKA machine types, like e2-standard-4 or n2-highcpu-32, or c6g.4xlarge or D8d v4).
> Overall goal is to be able to run the comparisons on a local machine, and be able to send jobs to clusters in clouds. Actual provisioning of cloud resources is a non-goal of this proposal.
> Given a (set) of tests, we should be able to generate a text file with the results, for collation similar to what go's benchstat does. Bonus points if we can have benchstat handle the task for us without modification.
> Similar to our release validation scripts, a release manager (or any user) should be able to access and compare results.
> eg. ./release_comparison.sh \{$OLD_VERSION} \{$NEW_VERSION}
> It must be able to support Release Candidate versions.
> Adding this kind of infrastructure will improve trust in Beam, beam releases, and allow others to more consistently compare performance results.
> This Jira stands as a proposal and if accepted, a place for discussion, and hanging subtasks and specifics.
> A side task that could be useful would be to be able to generate these text file versions of the benchmarks from querying the metrics database. Then the comparisons can be a few datapoints from around a given time point, to another, which at least make the release managers job a little easier, though that doesn't compare two releases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)