You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Weizhong (JIRA)" <ji...@apache.org> on 2015/08/08 02:34:46 UTC

[jira] [Commented] (SPARK-9066) Improve cartesian performance

    [ https://issues.apache.org/jira/browse/SPARK-9066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14662707#comment-14662707 ] 

Weizhong commented on SPARK-9066:
---------------------------------

Yes, the root reaason is same, that is cause by scan HDFS too many times, in [PR#6454|https://github.com/apache/spark/pull/6454] use coalesce to decrease partitions, but add two shuffles, but if we change the cartesian order also can decrease the scan times, which I have done in [PR#7417|https://github.com/apache/spark/pull/7417]

> Improve cartesian performance 
> ------------------------------
>
>                 Key: SPARK-9066
>                 URL: https://issues.apache.org/jira/browse/SPARK-9066
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Weizhong
>            Priority: Minor
>
> Currently, for CartesianProduct, if right plan partition record number are small than left partition record number, then the performance is bad as need do many times scan for right plan.
> For example:
> {noformat}
> with single_value as (
>   select max(1) tpcds_val from date_dim
> )
> select sum(ss_quantity * ss_sales_price) ssales, tpcds_val
> from store_sales, single_value
> group by tpcds_val
> {noformat}
> above SQL clause, right plan only have 1 record, left plan have 1823 partiton(in our test) and each partition has more than 4000 records, then for each left plan partition record we need scan data from hdfs for right plan.
> That is, for left plan we need scan _left_plan_partition_num_ times, for right plan we need scan _left_plan_partition_num * right_plan_partition_num_ times, total is  _left_plan_partition_num * (1 + right_plan_partition_num)_ times



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org