You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/10/07 06:10:18 UTC

[GitHub] [spark] zhengruifeng opened a new pull request, #38148: [SPARK-40698][PS][SQL] Improve the precision of `product` for integral inputs

zhengruifeng opened a new pull request, #38148:
URL: https://github.com/apache/spark/pull/38148

   ### What changes were proposed in this pull request?
   add a dedicated expression for `product`:
   
   1. for integral inputs, directly use `LongType` to avoid the rounding error:
   2. when ignoreNA is true, short circuit when meet a `zero`;
   3. when ignoreNA is false, short circuit when meet a `zero` or `null`;
   
   
   ### Why are the changes needed?
   
   1. existing computation logic is too complex in the PySpark side, with a dedicated expression, we can simplify the PySpark side and apply it in more cases.
   2. existing computation of `product` is likely to introduce rounding error for integral inputs, for example `55108 x 55108 x 55108 x 55108` in the following case:
   
   before:
   ```
   In [14]: df = pd.DataFrame({"a": [55108, 55108, 55108, 55108], "b": [55108.0, 55108.0, 55108.0, 55108.0], "c": [1, 2, 3, 4]})
   
   In [15]: df.a.prod()
   Out[15]: 9222710978872688896
   
   In [16]: type(df.a.prod())
   Out[16]: numpy.int64
   
   In [17]: df.b.prod()
   Out[17]: 9.222710978872689e+18
   
   In [18]: type(df.b.prod())
   Out[18]: numpy.float64
   
   In [19]: 
   
   In [19]: psdf = ps.from_pandas(df)
   
   In [20]: psdf.a.prod()
   Out[20]: 9222710978872658944
   
   In [21]: type(psdf.a.prod())
   Out[21]: int
   
   In [22]: psdf.b.prod()
   Out[22]: 9.222710978872659e+18
   
   In [23]: type(psdf.b.prod())
   Out[23]: float
   
   
   In [24]: df.a.prod() - psdf.a.prod()
   Out[24]: 29952
   ```
   
   after:
   ```
   In [1]: import pyspark.pandas as ps
   
   In [2]: import pandas as pd
   
   In [3]: df = pd.DataFrame({"a": [55108, 55108, 55108, 55108], "b": [55108.0, 55108.0, 55108.0, 55108.0], "c": [1, 2, 3, 4]})
   
   In [4]: df.a.prod()
   Out[4]: 9222710978872688896
   
   In [5]: psdf = ps.from_pandas(df)
   
   In [6]: psdf.a.prod()
   Out[6]: 9222710978872688896                                                     
   
   In [7]: df.a.prod() - psdf.a.prod()
   Out[7]: 0
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   existing UT
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon closed pull request #38148: [SPARK-40698][PS][SQL] Improve the precision of `product` for integral inputs

Posted by GitBox <gi...@apache.org>.
HyukjinKwon closed pull request #38148: [SPARK-40698][PS][SQL] Improve the precision of `product` for integral inputs
URL: https://github.com/apache/spark/pull/38148


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #38148: [SPARK-40698][PS][SQL] Improve the precision of `product` for integral inputs

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on code in PR #38148:
URL: https://github.com/apache/spark/pull/38148#discussion_r989860278


##########
python/pyspark/pandas/tests/test_generic_functions.py:
##########
@@ -200,6 +200,22 @@ def test_stat_functions(self):
         self.assert_eq(pdf.b.kurtosis(), psdf.b.kurtosis())
         self.assert_eq(pdf.c.kurtosis(), psdf.c.kurtosis())
 
+    def test_prod_precision(self):

Review Comment:
   this test will fail in original implementation due to precision loss
   
   the new implementation will provide the precise result for integral inputs, when the product in [Long.MinValue, Long.MaxValue]



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #38148: [SPARK-40698][PS][SQL] Improve the precision of `product` for integral inputs

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on PR #38148:
URL: https://github.com/apache/spark/pull/38148#issuecomment-1274017676

   thanks @srowen @HyukjinKwon for reivews


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #38148: [SPARK-40698][PS][SQL] Improve the precision of `product` for integral inputs

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on PR #38148:
URL: https://github.com/apache/spark/pull/38148#issuecomment-1274010511

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #38148: [SPARK-40698][PS][SQL] Improve the precision of `product` for integral inputs

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on PR #38148:
URL: https://github.com/apache/spark/pull/38148#issuecomment-1272478562

   also cc @HyukjinKwon @itholic @Yikun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org