You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Weizhong (JIRA)" <ji...@apache.org> on 2015/07/15 11:17:06 UTC

[jira] [Created] (SPARK-9066) Improve cartesian performance

Weizhong created SPARK-9066:
-------------------------------

             Summary: Improve cartesian performance 
                 Key: SPARK-9066
                 URL: https://issues.apache.org/jira/browse/SPARK-9066
             Project: Spark
          Issue Type: Improvement
          Components: SQL
            Reporter: Weizhong
            Priority: Minor


Currently, for CartesianProduct, if right plan partition number are small than left partition number, then then performance is bad as need do many times scan for right plan.
For example:
{noformat}
with single_value as (
  select max(1) tpcds_val from date_dim
)
select sum(ss_quantity * ss_sales_price) ssales, tpcds_val
from store_sales, single_value
group by tpcds_val
{noformat}
above SQL clause, right plan only have 1 partition, left plan have 1823 partiton(in our test), then for each left plan partition we need scan data from hdfs for right plan.

That is, for left plan we need scan _left_plan_partition_num_ times, for right plan we need scan _left_plan_partition_num * right_plan_partition_num_ times, total is  _left_plan_partition_num * left_plan_partition_num * right_plan_partition_num_ times



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org