You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Xinrong Meng (Jira)" <ji...@apache.org> on 2022/05/02 20:51:00 UTC

[jira] [Updated] (SPARK-39076) Standardize Statistical Functions of pandas API on Spark

     [ https://issues.apache.org/jira/browse/SPARK-39076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Xinrong Meng updated SPARK-39076:
---------------------------------
    Description: 
Statistical functions are the most commonly-used functions in Data Engineering and Data Analysis.

Spark and pandas provide statistical functions in the context of SQL and Data Science separately.

pandas API on Spark implements the pandas API on top of Apache Spark. Although there may be semantic differences of certain functions due to the high cost of big data calculations, for example, median. We should still try to reach the parity from the API level.

However, critical parameters, such as `skipna`,  of statistical functions are missing of basic objects: DataFrame, Series, and Index are missing. 

There is even a larger gap between statistical functions of pandas-on-Spark GroupBy objects and those of pandas GroupBy objects. In addition, tests coverage is far from perfect.

With statistical functions standardized, pandas API coverage will be increased since missing parameters will be implemented. That would further improve the user adoption.

See details at https://docs.google.com/document/d/1IHUQkSVMPWiK8Jhe0GUtMHnDS6LB4_z9K2ktWmORSSg/edit?usp=sharing.


  was:
Statistical functions are the most commonly-used functions in Data Engineering and Data Analysis.

Spark and pandas provide statistical functions in the context of SQL and Data Science separately.

pandas API on Spark implements the pandas API on top of Apache Spark. Although there may be semantic differences of certain functions due to the high cost of big data calculations, for example, median. We should still try to reach the parity from the API level.

However, critical parameters, such as `skipna`,  of statistical functions are missing of basic objects: DataFrame, Series, and Index are missing. 

There is even a larger gap between statistical functions of pandas-on-Spark GroupBy objects and those of pandas GroupBy objects. In addition, tests coverage is far from perfect.

With statistical functions standardized, pandas API coverage will be increased since missing parameters will be implemented. That would further improve the user adoption.



> Standardize Statistical Functions of pandas API on Spark
> --------------------------------------------------------
>
>                 Key: SPARK-39076
>                 URL: https://issues.apache.org/jira/browse/SPARK-39076
>             Project: Spark
>          Issue Type: Umbrella
>          Components: PySpark
>    Affects Versions: 3.4.0
>            Reporter: Xinrong Meng
>            Priority: Major
>
> Statistical functions are the most commonly-used functions in Data Engineering and Data Analysis.
> Spark and pandas provide statistical functions in the context of SQL and Data Science separately.
> pandas API on Spark implements the pandas API on top of Apache Spark. Although there may be semantic differences of certain functions due to the high cost of big data calculations, for example, median. We should still try to reach the parity from the API level.
> However, critical parameters, such as `skipna`,  of statistical functions are missing of basic objects: DataFrame, Series, and Index are missing. 
> There is even a larger gap between statistical functions of pandas-on-Spark GroupBy objects and those of pandas GroupBy objects. In addition, tests coverage is far from perfect.
> With statistical functions standardized, pandas API coverage will be increased since missing parameters will be implemented. That would further improve the user adoption.
> See details at https://docs.google.com/document/d/1IHUQkSVMPWiK8Jhe0GUtMHnDS6LB4_z9K2ktWmORSSg/edit?usp=sharing.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org