You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Shuyan Zhang (Jira)" <ji...@apache.org> on 2022/09/01 05:15:00 UTC

[jira] [Updated] (HADOOP-18426) Improve the accuracy of MutableStat mean

     [ https://issues.apache.org/jira/browse/HADOOP-18426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shuyan Zhang updated HADOOP-18426:
----------------------------------
    Description: 
The current MutableStat mean calculation method is more prone to loss accuracy because the sum of samples is too large. 
Storing large integers in the double type results in a loss of accuracy. For example, 9223372036854775707 and 9223372036854775708 are both stored as doubles as 9223372036854776000. Therefore, we should try to avoid using the cumulative total sum method to calculate the average, but update the average every time we sample. All in all, we can process each sample on its own to improve mean accuracy.

  was:The current MutableStat mean calculation method is more prone to loss accuracy because the sum of samples is too large. We can process each sample on its own to improve mean accuracy.


> Improve the accuracy of MutableStat mean
> ----------------------------------------
>
>                 Key: HADOOP-18426
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18426
>             Project: Hadoop Common
>          Issue Type: Bug
>            Reporter: Shuyan Zhang
>            Assignee: Shuyan Zhang
>            Priority: Major
>              Labels: pull-request-available
>
> The current MutableStat mean calculation method is more prone to loss accuracy because the sum of samples is too large. 
> Storing large integers in the double type results in a loss of accuracy. For example, 9223372036854775707 and 9223372036854775708 are both stored as doubles as 9223372036854776000. Therefore, we should try to avoid using the cumulative total sum method to calculate the average, but update the average every time we sample. All in all, we can process each sample on its own to improve mean accuracy.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org