You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2021/12/23 00:49:00 UTC

[jira] [Updated] (SPARK-37711) Create a plan for top & frequency for pandas-on-Spark optimization

     [ https://issues.apache.org/jira/browse/SPARK-37711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon updated SPARK-37711:
---------------------------------
    Summary: Create a plan for top & frequency for pandas-on-Spark optimization  (was: Create plans for top & frequency for pandas-on-Spark optimization)

> Create a plan for top & frequency for pandas-on-Spark optimization
> ------------------------------------------------------------------
>
>                 Key: SPARK-37711
>                 URL: https://issues.apache.org/jira/browse/SPARK-37711
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.3.0
>            Reporter: Haejoon Lee
>            Priority: Major
>
> When invoking the DataFrame.describe() in pandas API on Spark, the multiple Spark job is run as much as number of columns to retrieve the `top` and `freq` status.
> Top is the most common value in column, and the freq is the count of the most common value.
> We should write a util in Scala side, and make it return key value (count) in one Spark job. e.g.) Dataset.mapPartitions and calculate the summation of the key and value (count). Such APIs are missing in PySpark so we would have to write one in Scala side.
> See the [https://github.com/apache/spark/pull/34931#discussion_r772220260] for more detail.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org