You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by GitBox <gi...@apache.org> on 2020/11/26 08:18:31 UTC

[GitHub] [incubator-doris] xinyiZzz opened a new pull request #4959: [Bug] Fix the loss of precision when Decimal calculates variance/stddev

xinyiZzz opened a new pull request #4959:
URL: https://github.com/apache/incubator-doris/pull/4959


   ## Proposed changes
   
   By cast Decimal to Double, realize the variance calculation of Decimal type column.
   In my local Spark2.4 and Mysql, the variance of Decimal is also calculated by cast to Double.
   NOTE: There may be two parts of precision loss:
       - Decimal cast to Double
       - In the variance calculation process, the precision of the Double type intermediate variable is lost (the effective number of digits of Double is 16)
   
   ## Types of changes
   
   What types of changes does your code introduce to Doris?
   _Put an `x` in the boxes that apply_
   
   - [x] Bugfix (non-breaking change which fixes an issue)
   
   ## Checklist
   
   - [x] I have create an issue on (Fix #3127), and have described the bug/feature there in detail
   - [x] Compiling and unit tests pass locally with my changes
   
   ## Further comments
   
   Before fix the variance calculation by Decimal cast to Double, I implemented a method of directly calculating the variance on Decimal, using DecimalV2Value as the type of intermediate variable. see Commit: https://github.com/xinyiZzz/incubator-doris/commit/0288a6f76245bdf1f6eb0b54d3f5921414772066
   
   Compared with cast to Double, the considerations and advantages of calculating variance directly based on Decimal:
       - Less loss of precision.
       - Faster performance.
       - Consistent with other functions, such as `avg()`, `sum()`, etc., these functions use Decimal to calculate directly in Doris, and use DecimalV2Value as the type of intermediate variable.
   
   However, there is an overflow problem in calculating variance based on Decimal. When `Variance> 9223372036854775807999999999/N` (9223372036854775807999999999 is the maximum value of DecimalV2Value, N is the number of data rows), the intermediate variable in the calculation process will overflow and return a maximum value.
   
   I feel that theoretical business requirements will not calculate the variance greater than `9223372036854775807999999999/N`, so cast to Double means lower accuracy and worse performance, but both Spark and Mysql use cast to double methods, so I Pull this code.
   
   # precision comparison:
       - When the first 16 bits of the data are the same, such as `12345678901234560` and `12345678901234561`, at this time, the result calculated by cast to double is 0 (loss of precision), and the correct result by direct calculation by Decimal is 0.5
       - When the variance result is large, compared to the result after cast to double `42981942961.912369`, the result directly calculated by Decimal `42981942961.913988699` is more accurate
   
   # Performance comparison:
   The test results of variance/stddev and other functions are as follows:
       - Cast to Double calculation: 0.054s
       - Decimal direct calculation: 0.034s
   
   Table building method:
   ```
   CREATE TABLE stddev_samp_test29
   (event_day int, money DECIMAL(27,9) DEFAULT "0")
   DISTRIBUTED BY HASH(event_day) BUCKETS 5
   PROPERTIES("replication_num" = "1");
   ```
   
   Data with 10w rows of equal length, such as `99.00066`, is generated by:
   ```
   f = open('data.txt','w')
   n = 100000
   while (n):
    f.write('0,' + "%.5f"% (99 + random.random()) +'\n')
    n -= 1
   ```
   
   It’s too important to complete the preliminary research, so sad...


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] morningman commented on pull request #4959: [Bug] Fix the loss of precision when Decimal calculates variance/stddev

Posted by GitBox <gi...@apache.org>.
morningman commented on pull request #4959:
URL: https://github.com/apache/incubator-doris/pull/4959#issuecomment-734641674


   I think this PR it too trick.
   https://github.com/xinyiZzz/incubator-doris/commit/0288a6f76245bdf1f6eb0b54d3f5921414772066
   This seems better to me.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] xinyiZzz edited a comment on pull request #4959: [Bug] Fix the loss of precision when Decimal calculates variance/stddev

Posted by GitBox <gi...@apache.org>.
xinyiZzz edited a comment on pull request #4959:
URL: https://github.com/apache/incubator-doris/pull/4959#issuecomment-735405409


   > I think this PR it too trick.
   > [xinyiZzz@0288a6f](https://github.com/xinyiZzz/incubator-doris/commit/0288a6f76245bdf1f6eb0b54d3f5921414772066)
   > This seems better to me.
   
   I am not sure if it is inconsistent with Spark, whether it will cause problems later.
   Maybe I can do it in an elegant way, doing cast to double in Be's function, this will be similar to the second method.
   Also this will change more places. The first trick above is the smallest change I found.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] morningman merged pull request #4959: [Bug] Fix the loss of precision when Decimal calculates variance/stddev

Posted by GitBox <gi...@apache.org>.
morningman merged pull request #4959:
URL: https://github.com/apache/incubator-doris/pull/4959


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] xinyiZzz commented on pull request #4959: [Bug] Fix the loss of precision when Decimal calculates variance/stddev

Posted by GitBox <gi...@apache.org>.
xinyiZzz commented on pull request #4959:
URL: https://github.com/apache/incubator-doris/pull/4959#issuecomment-735405409


   > I think this PR it too trick.
   > [xinyiZzz@0288a6f](https://github.com/xinyiZzz/incubator-doris/commit/0288a6f76245bdf1f6eb0b54d3f5921414772066)
   > This seems better to me.
   
   Maybe I can do it in an elegant way, doing cast to double in Be's function.
   This will change more places. The trick above is the easiest way I can find.
   Because I am not sure if it is inconsistent with Spark, whether it will cause problems later.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] xinyiZzz edited a comment on pull request #4959: [Bug] Fix the loss of precision when Decimal calculates variance/stddev

Posted by GitBox <gi...@apache.org>.
xinyiZzz edited a comment on pull request #4959:
URL: https://github.com/apache/incubator-doris/pull/4959#issuecomment-735405409


   > I think this PR it too trick.
   > [xinyiZzz@0288a6f](https://github.com/xinyiZzz/incubator-doris/commit/0288a6f76245bdf1f6eb0b54d3f5921414772066)
   > This seems better to me.
   
   Maybe I can do it in an elegant way, doing cast to double in Be's function, this will be similar to the second method.
   Also this will change more places. The first trick above is the smallest change I found.
   But I am not sure if it is inconsistent with Spark, whether it will cause problems later.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] xinyiZzz edited a comment on pull request #4959: [Bug] Fix the loss of precision when Decimal calculates variance/stddev

Posted by GitBox <gi...@apache.org>.
xinyiZzz edited a comment on pull request #4959:
URL: https://github.com/apache/incubator-doris/pull/4959#issuecomment-735405409


   > I think this PR it too trick.
   > [xinyiZzz@0288a6f](https://github.com/xinyiZzz/incubator-doris/commit/0288a6f76245bdf1f6eb0b54d3f5921414772066)
   > This seems better to me.
   
   Maybe I can do it in an elegant way, doing cast to double in Be's function.
   This will change more places. The trick above is the easiest way I can find.
   But I am not sure if it is inconsistent with Spark, whether it will cause problems later.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org