You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by " Kevin Ma (Jira)" <ji...@apache.org> on 2020/03/13 06:29:00 UTC

[jira] [Created] (SPARK-31143) Spark 2.4.4 count distinct query much slower than Spark 1.6.2 and Hive 1.2.1

 Kevin Ma created SPARK-31143:
---------------------------------

             Summary: Spark 2.4.4 count distinct query much slower than Spark 1.6.2 and Hive 1.2.1
                 Key: SPARK-31143
                 URL: https://issues.apache.org/jira/browse/SPARK-31143
             Project: Spark
          Issue Type: Question
          Components: SQL
    Affects Versions: 2.4.4
         Environment: Spark 2.4.4 with self built Spark thrift server

Hadoop 2.6.0-cdh5.7.4

Hive 1.2.1

Spark 1.6.2 contained in CDH-5.7.4
            Reporter:  Kevin Ma


In our company, we are doing the migration of our ad-hoc query engine from Hive to Spark. We use Spark thrift server, and the version is 2.4.4. Many of the queries run well and faster than Hive. But we have a complex query with multiple count(distinct) expression. This query runs extremely slow on Spark, comparing to Hive. It runs much slower even comparing to Spark 1.6.2. The query is as flowing:
{code:java}
select
'All'as industry_name
,sum(shop_cnt)/31 as shop_cnt
,sum(sku_cnt)/31 as sku_cnt
,sum(limit_act_sku_cnt)/31 as limit_act_sku_cnt
,sum(discount_act_sku_cnt)/31 as discount_act_sku_cnt
,sum(sku_mid_price)/31 as sku_mid_price
,sum(sku_high_mid_price)/31 as sku_high_mid_price
FROM
(
select
cal_dt
,approx_count_distinct(shop_id) as shop_cnt
,sum(sku_cnt)/approx_count_distinct(shop_id) as sku_cnt
,sum(limit_act_sku_cnt)/approx_count_distinct(shop_id) as limit_act_sku_cnt
,sum(discount_act_sku_cnt)/approx_count_distinct(shop_id) as discount_act_sku_cnt
,percentile(cast(sku_mid_price as bigint) ,0.5) as sku_mid_price
,percentile(cast(sku_high_mid_price as bigint),0.75) as sku_high_mid_price
from
(
select
cal_dt
,vender_id
,shop_id
,approx_count_distinct(sku_id) as sku_cnt
,approx_count_distinct(case when is_limit_grab_act_sku=1 then sku_id end) as limit_act_sku_cnt
,approx_count_distinct(case when is_offer_act_sku=1 then sku_id end) as discount_act_sku_cnt
,percentile(cast(sku_price as bigint),0.5) as sku_mid_price
,percentile(cast(sku_price as bigint),0.75) as sku_high_mid_price
from bi_dw.dw_dj_prd_shop_sku_info
where cal_dt = '2019-12-01'
group by cal_dt, vender_id, shop_id
) a
group by cal_dt
) a;
{code}
The query took about 18 minutes to run on Spark 2.4.4. And it took only about 80 seconds to run on Hive 1.2.1. On Spark 1.6.2, it only took about 2 to 3 minutes (run on Spark shell, no accurate time taken output).

When investigating this, I found the Jira https://issues.apache.org/jira/browse/SPARK-9241, which optimizes count disctint. But when I look at the code of Spark 2.4.4, I found the related code is not there. 

So my question is: why the code is removed?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org