You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Xinrong Meng (Jira)" <ji...@apache.org> on 2022/04/30 00:18:00 UTC

[jira] [Created] (SPARK-39076) Standardize Statistical Functions of pandas API on Spark

Xinrong Meng created SPARK-39076:
------------------------------------

Summary: Standardize Statistical Functions of pandas API on Spark
Key: SPARK-39076
URL: https://issues.apache.org/jira/browse/SPARK-39076
Project: Spark
Issue Type: Umbrella
Components: PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng

Statistical functions are the most commonly-used functions in Data Engineering and Data Analysis.

Spark and pandas provide statistical functions in the context of SQL and Data Science separately.

pandas API on Spark implements the pandas API on top of Apache Spark. Although there may be semantic differences of certain functions due to the high cost of big data calculations, for example, median. We should still try to reach the parity from the API level.

However, critical parameters, such as `skipna`, of statistical functions are missing of basic objects: DataFrame, Series, and Index are missing.

There is even a larger gap between statistical functions of pandas-on-Spark GroupBy objects and those of pandas GroupBy objects. In addition, tests coverage is far from perfect.

With statistical functions standardized, pandas API coverage will be increased since missing parameters will be implemented. That would further improve the user adoption.

--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org