You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/10/07 06:10:18 UTC
[GitHub] [spark] zhengruifeng opened a new pull request, #38148: [SPARK-40698][PS][SQL] Improve the precision of `product` for integral inputs
zhengruifeng opened a new pull request, #38148:
URL: https://github.com/apache/spark/pull/38148
### What changes were proposed in this pull request?
add a dedicated expression for `product`:
1. for integral inputs, directly use `LongType` to avoid the rounding error:
2. when ignoreNA is true, short circuit when meet a `zero`;
3. when ignoreNA is false, short circuit when meet a `zero` or `null`;
### Why are the changes needed?
1. existing computation logic is too complex in the PySpark side, with a dedicated expression, we can simplify the PySpark side and apply it in more cases.
2. existing computation of `product` is likely to introduce rounding error for integral inputs, for example `55108 x 55108 x 55108 x 55108` in the following case:
before:
```
In [14]: df = pd.DataFrame({"a": [55108, 55108, 55108, 55108], "b": [55108.0, 55108.0, 55108.0, 55108.0], "c": [1, 2, 3, 4]})
In [15]: df.a.prod()
Out[15]: 9222710978872688896
In [16]: type(df.a.prod())
Out[16]: numpy.int64
In [17]: df.b.prod()
Out[17]: 9.222710978872689e+18
In [18]: type(df.b.prod())
Out[18]: numpy.float64
In [19]:
In [19]: psdf = ps.from_pandas(df)
In [20]: psdf.a.prod()
Out[20]: 9222710978872658944
In [21]: type(psdf.a.prod())
Out[21]: int
In [22]: psdf.b.prod()
Out[22]: 9.222710978872659e+18
In [23]: type(psdf.b.prod())
Out[23]: float
In [24]: df.a.prod() - psdf.a.prod()
Out[24]: 29952
```
after:
```
In [1]: import pyspark.pandas as ps
In [2]: import pandas as pd
In [3]: df = pd.DataFrame({"a": [55108, 55108, 55108, 55108], "b": [55108.0, 55108.0, 55108.0, 55108.0], "c": [1, 2, 3, 4]})
In [4]: df.a.prod()
Out[4]: 9222710978872688896
In [5]: psdf = ps.from_pandas(df)
In [6]: psdf.a.prod()
Out[6]: 9222710978872688896
In [7]: df.a.prod() - psdf.a.prod()
Out[7]: 0
```
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existing UT
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #38148: [SPARK-40698][PS][SQL] Improve the precision of `product` for integral inputs
Posted by GitBox <gi...@apache.org>.
HyukjinKwon closed pull request #38148: [SPARK-40698][PS][SQL] Improve the precision of `product` for integral inputs
URL: https://github.com/apache/spark/pull/38148
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #38148: [SPARK-40698][PS][SQL] Improve the precision of `product` for integral inputs
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on code in PR #38148:
URL: https://github.com/apache/spark/pull/38148#discussion_r989860278
##########
python/pyspark/pandas/tests/test_generic_functions.py:
##########
@@ -200,6 +200,22 @@ def test_stat_functions(self):
self.assert_eq(pdf.b.kurtosis(), psdf.b.kurtosis())
self.assert_eq(pdf.c.kurtosis(), psdf.c.kurtosis())
+ def test_prod_precision(self):
Review Comment:
this test will fail in original implementation due to precision loss
the new implementation will provide the precise result for integral inputs, when the product in [Long.MinValue, Long.MaxValue]
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #38148: [SPARK-40698][PS][SQL] Improve the precision of `product` for integral inputs
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on PR #38148:
URL: https://github.com/apache/spark/pull/38148#issuecomment-1274017676
thanks @srowen @HyukjinKwon for reivews
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #38148: [SPARK-40698][PS][SQL] Improve the precision of `product` for integral inputs
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on PR #38148:
URL: https://github.com/apache/spark/pull/38148#issuecomment-1274010511
Merged to master.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #38148: [SPARK-40698][PS][SQL] Improve the precision of `product` for integral inputs
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on PR #38148:
URL: https://github.com/apache/spark/pull/38148#issuecomment-1272478562
also cc @HyukjinKwon @itholic @Yikun
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org