You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/09/14 04:12:45 UTC
[GitHub] [spark] itholic commented on a diff in pull request #37836: [SPARK-40339][SPARK-40342][SPARK-40345][SPARK-40348][PS] Implement quantile in Rolling/RollingGroupby/Expanding/ExpandingGroupby

itholic commented on code in PR #37836:
URL: https://github.com/apache/spark/pull/37836#discussion_r970207129


##########
python/pyspark/pandas/window.py:
##########
@@ -561,6 +573,101 @@ def mean(self) -> FrameLike:
         """
         return super().mean()
 
+    def quantile(self, quantile: float, accuracy: int = 10000) -> FrameLike:
+        """
+        Calculate the rolling quantile of the values.
+
+        .. versionadded:: 3.4.0
+
+        Parameters
+        ----------
+        quantile : float
+            Value between 0 and 1 providing the quantile to compute.
+        accuracy : int, optional
+            Default accuracy of approximation. Larger value means better accuracy.
+            The relative error can be deduced by 1.0 / accuracy.
+            This is a panda-on-Spark specific parameter.
+
+        Returns
+        -------
+        Series or DataFrame
+            Returned object type is determined by the caller of the rolling
+            calculation.
+
+        Notes
+        -----
+        `quantile` in pandas-on-Spark are using distributed percentile approximation
+        algorithm unlike pandas, the result might different with pandas, also `interpolation`
+        parameters are not supported yet.

Review Comment:
   nit: "parameters are not supported yet" -> "parameter is not supported yet" ?
   
   Also can we comment it as "TODO" above function definition ?
   
   e.g.
   
   ```python
       # TODO: support `interpolation` parameter.
       def quantile(self, quantile: float, accuracy: int = 10000) -> FrameLike:
           ...
   ```



##########
python/pyspark/pandas/window.py:
##########
@@ -561,6 +573,101 @@ def mean(self) -> FrameLike:
         """
         return super().mean()
 
+    def quantile(self, quantile: float, accuracy: int = 10000) -> FrameLike:
+        """
+        Calculate the rolling quantile of the values.
+
+        .. versionadded:: 3.4.0
+
+        Parameters
+        ----------
+        quantile : float
+            Value between 0 and 1 providing the quantile to compute.
+        accuracy : int, optional
+            Default accuracy of approximation. Larger value means better accuracy.
+            The relative error can be deduced by 1.0 / accuracy.
+            This is a panda-on-Spark specific parameter.
+
+        Returns
+        -------
+        Series or DataFrame
+            Returned object type is determined by the caller of the rolling
+            calculation.
+
+        Notes
+        -----
+        `quantile` in pandas-on-Spark are using distributed percentile approximation
+        algorithm unlike pandas, the result might different with pandas, also `interpolation`
+        parameters are not supported yet.
+
+        the current implementation of this API uses Spark's Window without
+        specifying partition specification. This leads to move all data into
+        single partition in single machine and could cause serious
+        performance degradation. Avoid this method against very large dataset.
+
+        See Also
+        --------
+        Series.rolling : Calling object with Series data.
+        DataFrame.rolling : Calling object with DataFrames.
+        Series.quantile : Equivalent method for Series.
+        DataFrame.quantile : Equivalent method for DataFrame.

Review Comment:
   Follow the description from pandas' ?
   
   <img width="548" alt="Screen Shot 2022-09-14 at 1 08 58 PM" src="https://user-images.githubusercontent.com/44108233/190057481-0c0c239c-674d-4189-8625-41af2664ae2e.png">
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org