You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:34:21 UTC

[jira] [Resolved] (SPARK-15877) DataSource executed twice when using ORDER BY

     [ https://issues.apache.org/jira/browse/SPARK-15877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-15877.
----------------------------------
    Resolution: Incomplete

> DataSource executed twice when using ORDER BY
> ---------------------------------------------
>
>                 Key: SPARK-15877
>                 URL: https://issues.apache.org/jira/browse/SPARK-15877
>             Project: Spark
>          Issue Type: Bug
>    Affects Versions: 1.6.1
>            Reporter: Matthew Livesey
>            Priority: Major
>              Labels: bulk-closed
>
> When executing using a custom DataSource, I observed that if using an Order By, the underlying DataSource is executed twice. A small example demonstrating this is here: 
> https://github.com/mattinbits/spark-sql-sort-double-execution
> From debugging, I find that when there is a "Sort" in the Logical plan (and therefore the resulting Physical plan) an "Exchange" object is inserted into the plan by the method "EnsureRequirements.ensureDistributionAndOrdering()". this Exchange has a RangePartitioner. At the point that the dataframe is converted to an RDD, (which is before computation of that RDD should occur). The RangePartitioner causes execution of the RDD for the purpose of sampling to get statistics to guide partitioning. This is done by "Exchange.prepareShuffleDependency()" which in turn calls "SamplingUtils.reservoirSampleAndCount()". 
> In some cases, this causes a significant performance degradation, anywhere that the cost of computing the RDD is high. The RDD gets executed twice, once during the conversion of Dataframe to RDD, and then again when the RDD is eventually computed, e.g. by a call to "collect()". There doesn't appear to be any configuration setting to control whether an RDD should be executed for sampling, and I haven't been able to determine whether the sampling is necessary to get the correct results or if it is just aimed at improving performance. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org