You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@daffodil.apache.org by "Mike Beckerle (Jira)" <ji...@apache.org> on 2021/01/07 20:39:00 UTC

[jira] [Closed] (DAFFODIL-1510) Improve performance report with variance information

     [ https://issues.apache.org/jira/browse/DAFFODIL-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mike Beckerle closed DAFFODIL-1510.
-----------------------------------
      Assignee: Mike Beckerle
    Resolution: Not A Problem

This is an Owl-internal thing. Not part of Daffodil.

> Improve performance report with variance information
> ----------------------------------------------------
>
>                 Key: DAFFODIL-1510
>                 URL: https://issues.apache.org/jira/browse/DAFFODIL-1510
>             Project: Daffodil
>          Issue Type: Improvement
>          Components: Performance, QA
>            Reporter: Mike Beckerle
>            Assignee: Mike Beckerle
>            Priority: Major
>
> A big improvement for these reports would be to make them "self-noise-eliminating", so unlike the report attached, one could eliminate all the red-lights that are about deltas that are "in the noise".
> We want to attract attention (i.e., red-light) deltas that represent a statistically significant drop in performance. This can be a drop relative to prior performance of this branch, or a drop relative to prior performance of a baseline release.
> To do this you need variance-based statistics like Z-score, which is based on standard deviation. Z-score means "how many standard deviations away from the mean is this value." Z-score's between -1 and 1 imply "it's ordinary variation, due to noise most likely". Z-score outside of -1 to 1 implies "it's significant. take a look."
> We need the mean and standard deviation of (previousVal - baselineVal). We can then compute (currentVal - baselineVal), and if its z-score is < -1.0, then we would red-light the value - it means there is a statistically significant degradation in performance (relative to the baseline) due to this commit's code changes. This would only red-light changes due to this code commit. If a test performance is relatively unchanged day to day, but always slow relative to the baseline, this would not red-light that day's delta.
> We probably also want to red-light if there is a general degradation in performance even for tests that are running faster than the baseline, so we would also want mean and standard deviation of previousVal, and similarly red-light if the delta z-score (relative to previousVal) is < -1.0.
> And we want to red-light (or pink-light) tests that are simply slower than the baseline by a statistically significant amount as an ongoing trend. So we would include the currentVal in the mean and stdDev(previousVal), and for mean and stdDev(previousVal - baselineVal). Like everything else here, the assumption is these values are time taken, so lower is better/faster. If the mean of previousVal-baselineVal is negative by more than the stdDev(previousVal - baselineVal), then the trend is that this test is slower than the baseline by a significant amount on an ongoing basis, so we should "pink light" the test results. That particular day's run might or might not have reflected a statistically significant improvement or degradation, but the trend is still below the baseline by a statistically significant amount.
> This takes all the noise variability out of the color highlighting.
> Example:
> baseline is 200, previous is 150, current 139. Mean of prev-baseline is 175, and std-dev of prev-baseline is 12.
> So, current - prev-baseline is -36. Z-score of that is -3.0 which is < -1.0. So red-light goes on.
> Example 2:
> Current is 120. Mean of previous is 142, standard deviation of previous is 12.
> Delta from mean is -22. zscore is -22/12 = -1.83 which is < -1.0, so we red-light this because it represents a statistically significant drop in performance from the average for that test.
> Example 3:
> Current is 120, folding that into mean and std deviation of (previous - baseline) gives mean -20 stdDev of 10. That means the test is generally 20 units slower than the baseline. The z-score of -20 relative to stdDev 10 is -2.0, so we would "pink light" the test, as generally being slower than the baseline on an ongoing basis.
> The inverse of these - statistically significant improvements, could generate green-light, (or light-green).
> To compute this you need at least 12 points of history so that you can have a meaningful mean and standard deviation to compute from.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)