You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by " Kevin Ma (Jira)" <ji...@apache.org> on 2020/03/13 07:03:00 UTC
[jira] [Resolved] (SPARK-31143) Spark 2.4.4 count distinct query much slower than Spark 1.6.2 and Hive 1.2.1

     [ https://issues.apache.org/jira/browse/SPARK-31143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kevin Ma resolved SPARK-31143.
------------------------------
    Resolution: Not A Problem

> Spark 2.4.4 count distinct query much slower than Spark 1.6.2 and Hive 1.2.1
> ----------------------------------------------------------------------------
>
>                 Key: SPARK-31143
>                 URL: https://issues.apache.org/jira/browse/SPARK-31143
>             Project: Spark
>          Issue Type: Question
>          Components: SQL
>    Affects Versions: 2.4.4
>         Environment: Spark 2.4.4 with self built Spark thrift server
> Hadoop 2.6.0-cdh5.7.4
> Hive 1.2.1
> Spark 1.6.2 contained in CDH-5.7.4
>            Reporter:  Kevin Ma
>            Priority: Major
>
> In our company, we are doing the migration of our ad-hoc query engine from Hive to Spark. We use Spark thrift server, and the version is 2.4.4. Many of the queries run well and faster than Hive. But we have a complex query with multiple count(distinct) expression. This query runs extremely slow on Spark, comparing to Hive. It runs much slower even comparing to Spark 1.6.2. The query is as flowing:
> {code:java}
> select
> 'All'as industry_name
> ,sum(shop_cnt)/31 as shop_cnt
> ,sum(sku_cnt)/31 as sku_cnt
> ,sum(limit_act_sku_cnt)/31 as limit_act_sku_cnt
> ,sum(discount_act_sku_cnt)/31 as discount_act_sku_cnt
> ,sum(sku_mid_price)/31 as sku_mid_price
> ,sum(sku_high_mid_price)/31 as sku_high_mid_price
> FROM
> (
> select
> cal_dt
> ,approx_count_distinct(shop_id) as shop_cnt
> ,sum(sku_cnt)/approx_count_distinct(shop_id) as sku_cnt
> ,sum(limit_act_sku_cnt)/approx_count_distinct(shop_id) as limit_act_sku_cnt
> ,sum(discount_act_sku_cnt)/approx_count_distinct(shop_id) as discount_act_sku_cnt
> ,percentile(cast(sku_mid_price as bigint) ,0.5) as sku_mid_price
> ,percentile(cast(sku_high_mid_price as bigint),0.75) as sku_high_mid_price
> from
> (
> select
> cal_dt
> ,vender_id
> ,shop_id
> ,approx_count_distinct(sku_id) as sku_cnt
> ,approx_count_distinct(case when is_limit_grab_act_sku=1 then sku_id end) as limit_act_sku_cnt
> ,approx_count_distinct(case when is_offer_act_sku=1 then sku_id end) as discount_act_sku_cnt
> ,percentile(cast(sku_price as bigint),0.5) as sku_mid_price
> ,percentile(cast(sku_price as bigint),0.75) as sku_high_mid_price
> from bi_dw.dw_dj_prd_shop_sku_info
> where cal_dt = '2019-12-01'
> group by cal_dt, vender_id, shop_id
> ) a
> group by cal_dt
> ) a;
> {code}
> The query took about 18 minutes to run on Spark 2.4.4. And it took only about 80 seconds to run on Hive 1.2.1. On Spark 1.6.2, it only took about 2 to 3 minutes (run on Spark shell, no accurate time taken output).
> When investigating this, I found the Jira https://issues.apache.org/jira/browse/SPARK-9241, which optimizes count disctint. But when I look at the code of Spark 2.4.4, I found the related code is not there. 
> So my question is: why the code is removed?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org