You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean R. Owen (Jira)" <ji...@apache.org> on 2021/07/20 16:28:00 UTC

[jira] [Commented] (SPARK-35837) Recommendations for Common Query Problems

    [ https://issues.apache.org/jira/browse/SPARK-35837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17384375#comment-17384375 ] 

Sean R. Owen commented on SPARK-35837:
--------------------------------------

What would this look like in Spark though? how do you surface recommendations?

> Recommendations for Common Query Problems
> -----------------------------------------
>
>                 Key: SPARK-35837
>                 URL: https://issues.apache.org/jira/browse/SPARK-35837
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 3.2.0
>            Reporter: Yuming Wang
>            Priority: Major
>
> Teradata supports [Recommendations for Common Query Problems|https://docs.teradata.com/r/wada1XMYPkZVTqPKz2CNaw/JE7PEg6H~4nBZYEGphxxsg].
> We can implement a similar feature.
>  1. Detect the most skew values for join. The user decides whether these are needed.
>  2. Detect the most skew values for window function. The user decides whether these are needed.
>  3. Detect the bucket read, for example, Analyzer add a cast to bucket column.
>  4. Recommend the user add a filter condition to the partition column of the partition table.
>  5. Check the condition of join, for example, the result of cast string to double may be incorrect.
> For example:
> {code:sql}
> 0: jdbc:hive2://localhost:10000/default> EXPLAIN RECOMMENDATION
> 0: jdbc:hive2://localhost:10000/default> SELECT a.*,
> 0: jdbc:hive2://localhost:10000/default>        CASE
> 0: jdbc:hive2://localhost:10000/default>          WHEN ( NOT ( a.exclude = 1
> 0: jdbc:hive2://localhost:10000/default>                       AND a.cobrand = 6
> 0: jdbc:hive2://localhost:10000/default>                       AND a.primary_app_id IN ( 1462, 2878, 2571 ) ) )
> 0: jdbc:hive2://localhost:10000/default>               AND ( a.valid_page_count = 1 ) THEN 1
> 0: jdbc:hive2://localhost:10000/default>          ELSE 0
> 0: jdbc:hive2://localhost:10000/default>        END AS is_singlepage,
> 0: jdbc:hive2://localhost:10000/default>        ca.bsns_vrtcl_name
> 0: jdbc:hive2://localhost:10000/default> FROM   (SELECT *
> 0: jdbc:hive2://localhost:10000/default>         FROM   (SELECT *,
> 0: jdbc:hive2://localhost:10000/default>                        'VI' AS page_type
> 0: jdbc:hive2://localhost:10000/default>                 FROM   tbl1
> 0: jdbc:hive2://localhost:10000/default>                 UNION
> 0: jdbc:hive2://localhost:10000/default>                 SELECT *,
> 0: jdbc:hive2://localhost:10000/default>                        'SRP' AS page_type
> 0: jdbc:hive2://localhost:10000/default>                 FROM   tbl2
> 0: jdbc:hive2://localhost:10000/default>                 UNION
> 0: jdbc:hive2://localhost:10000/default>                 SELECT *,
> 0: jdbc:hive2://localhost:10000/default>                        'My Garage' AS page_type
> 0: jdbc:hive2://localhost:10000/default>                 FROM   tbl3
> 0: jdbc:hive2://localhost:10000/default>                 UNION
> 0: jdbc:hive2://localhost:10000/default>                 SELECT *,
> 0: jdbc:hive2://localhost:10000/default>                        'Motors Homepage' AS page_type
> 0: jdbc:hive2://localhost:10000/default>                 FROM   tbl4) t
> 0: jdbc:hive2://localhost:10000/default>         WHERE  session_start_dt BETWEEN ( '2020-01-01' ) AND (
> 0: jdbc:hive2://localhost:10000/default>                                         CURRENT_DATE() - 10 )) a
> 0: jdbc:hive2://localhost:10000/default>        LEFT JOIN (SELECT item_id,
> 0: jdbc:hive2://localhost:10000/default>                          item_site_id,
> 0: jdbc:hive2://localhost:10000/default>                          auct_end_dt,
> 0: jdbc:hive2://localhost:10000/default>                          leaf_categ_id
> 0: jdbc:hive2://localhost:10000/default>                   FROM   tbl5
> 0: jdbc:hive2://localhost:10000/default>                   WHERE  auct_end_dt >= ( '2020-01-01' )) itm
> 0: jdbc:hive2://localhost:10000/default>               ON a.item_id = itm.item_id
> 0: jdbc:hive2://localhost:10000/default>        LEFT JOIN tbl6 ca
> 0: jdbc:hive2://localhost:10000/default>               ON itm.leaf_categ_id = ca.leaf_categ_id
> 0: jdbc:hive2://localhost:10000/default>                  AND itm.item_site_id = ca.site_id;
> +-----------------------------------------------------------------------------------------------------------------------------------------+--+
> |                                                                 result                                                                  |
> +-----------------------------------------------------------------------------------------------------------------------------------------+--+
> | 1. Detect the most skew values for join                                                                                                 |
> |   Check join: Join LeftOuter, ((leaf_categ_id#1453 = leaf_categ_id#3020) AND (cast(item_site_id#1444 as decimal(9,0)) = site_id#3022))  |
> |     table: tbl5                                                                                                                         |
> |       leaf_categ_id, item_site_id, count                                                                                                |
> |       171243, 0, 115412614                                                                                                              |
> |       176984, 3, 81003252                                                                                                               |
> |       176985, 3, 75035585                                                                                                               |
> |       179680, 77, 71831618                                                                                                              |
> |       171243, 3, 69821716                                                                                                               |
> |   Check join: Join LeftOuter, (item_id#3244 = cast(item_id#1439 as decimal(20,0)))                                                      |
> |     table: tbl1                                                                                                                         |
> |       item_id, count                                                                                                                    |
> |       null, 254738836                                                                                                                   |
> |       232644544775, 35530                                                                                                               |
> |       372028681865, 24189                                                                                                               |
> |       124014973844, 21725                                                                                                               |
> |       303476035378, 17075                                                                                                               |
> |                                                                                                                                         |
> | 2. Detect the most skew values for window function                                                                                      |
> |                                                                                                                                         |
> | 3. Detect the bucket read                                                                                                               |
> |   Check join: SortMergeJoin [item_id#3244], [cast(item_id#1439 as decimal(20,0))], LeftOuter                                            |
> |     The data type do not match for bucket read, consider cast `item_id` to decimal(18,0).                                               |
> |                                                                                                                                         |
> | 4. Add a filter condition to partition column                                                                                           |
> |                                                                                                                                         |
> | 5. Check the dangerous join condition                                                                                                                 |
> |                                                                                                                                         |
> +-----------------------------------------------------------------------------------------------------------------------------------------+--+
> 27 rows selected (152.685 seconds)
> {code}
> The optimized query is:
> {code:sql}
> SELECT a.*,
>      CASE
>        WHEN ( NOT ( a.exclude = 1
>                     AND a.cobrand = 6
>                     AND a.primary_app_id IN ( 1462, 2878, 2571 ) ) )
>             AND ( a.valid_page_count = 1 ) THEN 1
>        ELSE 0
>      END AS is_singlepage,
>      ca.bsns_vrtcl_name
> FROM   (SELECT *
>       FROM   (SELECT *,
>                      'VI' AS page_type
>               FROM   tbl1 where item_id is not null -- Checked with user, the null values is not needed. Add a filter condition.
>               UNION
>               SELECT *,
>                      'SRP' AS page_type
>               FROM   tbl2
>               UNION
>               SELECT *,
>                      'My Garage' AS page_type
>               FROM   tbl3
>               UNION
>               SELECT *,
>                      'Motors Homepage' AS page_type
>               FROM   tbl4) t
>       WHERE  session_start_dt BETWEEN ( '2020-01-01' ) AND (
>                                       CURRENT_DATE() - 10 )) a
>      LEFT JOIN (SELECT item_id,
>                        item_site_id,
>                        auct_end_dt,
>                        leaf_categ_id
>                 FROM   tbl5
>                 WHERE  auct_end_dt >= ( '2020-01-01' )) itm
>             ON cast(a.item_id as decimal(18, 0)) = cast(itm.item_id as decimal(18, 0)) -- Checked with user. cast to decimal(18, 0) is safe.
>      LEFT JOIN tbl6 ca
>             ON itm.leaf_categ_id = ca.leaf_categ_id
>                AND itm.item_site_id = ca.site_id 
> {code}
>  
> With these optimizations, the query execution time is reduced from 42 minutes to 6.3 minutes.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org