You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "S Daniel Zafar (Jira)" <ji...@apache.org> on 2020/03/13 16:03:00 UTC
[jira] [Resolved] (SPARK-31137) Opportunity to simplify execution
plan when passing empty dataframes to subtract()
[ https://issues.apache.org/jira/browse/SPARK-31137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
S Daniel Zafar resolved SPARK-31137.
------------------------------------
Resolution: Won't Do
moving to Databricks internal board.
> Opportunity to simplify execution plan when passing empty dataframes to subtract()
> ----------------------------------------------------------------------------------
>
> Key: SPARK-31137
> URL: https://issues.apache.org/jira/browse/SPARK-31137
> Project: Spark
> Issue Type: Improvement
> Components: PySpark, SQL
> Affects Versions: 2.4.5
> Reporter: S Daniel Zafar
> Priority: Minor
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> Execution plans are similar when passing an empty versus non-empty DataFrame to pyspark's subtract call.
> {code:java}
> df.subtract(regDf){code}
> yields the same physical plan as:
> {code:java}
> df.subtract(emptyDf){code}
> Since the operation (EXCEPT DISTINCT in Spark SQL) requires a sort on both DataFrames, this can yield some significant performance speed-ups because if the incoming DF is empty no processing should happen.
>
> Should be a quick fix for a seasoned commiter.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org