You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/06/22 02:22:00 UTC

[jira] [Resolved] (SPARK-28128) Pandas Grouped UDFs should skip over empty partitions

     [ https://issues.apache.org/jira/browse/SPARK-28128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-28128.
----------------------------------
       Resolution: Fixed
    Fix Version/s: 3.0.0

Issue resolved by pull request 24926
[https://github.com/apache/spark/pull/24926]

> Pandas Grouped UDFs should skip over empty partitions
> -----------------------------------------------------
>
>                 Key: SPARK-28128
>                 URL: https://issues.apache.org/jira/browse/SPARK-28128
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark, SQL
>    Affects Versions: 2.4.3
>            Reporter: Bryan Cutler
>            Assignee: Bryan Cutler
>            Priority: Major
>             Fix For: 3.0.0
>
>
> When running FlatMapGroupsInPandasExec or AggregateInPandasExec the shuffle uses a default number of partitions of 200 in "spark.sql.shuffle.partitions". If the data is small, e.g. in testing, many of the partitions will be empty but are treated just the same. For example, ArrowPythonRunner.compute is called and starts a number of threads that do nothing since there is no iteration. These computations could be skipped for empty partitions, which will save time overall.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org