You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean R. Owen (Jira)" <ji...@apache.org> on 2021/07/20 16:28:00 UTC
[jira] [Commented] (SPARK-35837) Recommendations for Common Query
Problems
[ https://issues.apache.org/jira/browse/SPARK-35837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17384375#comment-17384375 ]
Sean R. Owen commented on SPARK-35837:
--------------------------------------
What would this look like in Spark though? how do you surface recommendations?
> Recommendations for Common Query Problems
> -----------------------------------------
>
> Key: SPARK-35837
> URL: https://issues.apache.org/jira/browse/SPARK-35837
> Project: Spark
> Issue Type: New Feature
> Components: SQL
> Affects Versions: 3.2.0
> Reporter: Yuming Wang
> Priority: Major
>
> Teradata supports [Recommendations for Common Query Problems|https://docs.teradata.com/r/wada1XMYPkZVTqPKz2CNaw/JE7PEg6H~4nBZYEGphxxsg].
> We can implement a similar feature.
> 1. Detect the most skew values for join. The user decides whether these are needed.
> 2. Detect the most skew values for window function. The user decides whether these are needed.
> 3. Detect the bucket read, for example, Analyzer add a cast to bucket column.
> 4. Recommend the user add a filter condition to the partition column of the partition table.
> 5. Check the condition of join, for example, the result of cast string to double may be incorrect.
> For example:
> {code:sql}
> 0: jdbc:hive2://localhost:10000/default> EXPLAIN RECOMMENDATION
> 0: jdbc:hive2://localhost:10000/default> SELECT a.*,
> 0: jdbc:hive2://localhost:10000/default> CASE
> 0: jdbc:hive2://localhost:10000/default> WHEN ( NOT ( a.exclude = 1
> 0: jdbc:hive2://localhost:10000/default> AND a.cobrand = 6
> 0: jdbc:hive2://localhost:10000/default> AND a.primary_app_id IN ( 1462, 2878, 2571 ) ) )
> 0: jdbc:hive2://localhost:10000/default> AND ( a.valid_page_count = 1 ) THEN 1
> 0: jdbc:hive2://localhost:10000/default> ELSE 0
> 0: jdbc:hive2://localhost:10000/default> END AS is_singlepage,
> 0: jdbc:hive2://localhost:10000/default> ca.bsns_vrtcl_name
> 0: jdbc:hive2://localhost:10000/default> FROM (SELECT *
> 0: jdbc:hive2://localhost:10000/default> FROM (SELECT *,
> 0: jdbc:hive2://localhost:10000/default> 'VI' AS page_type
> 0: jdbc:hive2://localhost:10000/default> FROM tbl1
> 0: jdbc:hive2://localhost:10000/default> UNION
> 0: jdbc:hive2://localhost:10000/default> SELECT *,
> 0: jdbc:hive2://localhost:10000/default> 'SRP' AS page_type
> 0: jdbc:hive2://localhost:10000/default> FROM tbl2
> 0: jdbc:hive2://localhost:10000/default> UNION
> 0: jdbc:hive2://localhost:10000/default> SELECT *,
> 0: jdbc:hive2://localhost:10000/default> 'My Garage' AS page_type
> 0: jdbc:hive2://localhost:10000/default> FROM tbl3
> 0: jdbc:hive2://localhost:10000/default> UNION
> 0: jdbc:hive2://localhost:10000/default> SELECT *,
> 0: jdbc:hive2://localhost:10000/default> 'Motors Homepage' AS page_type
> 0: jdbc:hive2://localhost:10000/default> FROM tbl4) t
> 0: jdbc:hive2://localhost:10000/default> WHERE session_start_dt BETWEEN ( '2020-01-01' ) AND (
> 0: jdbc:hive2://localhost:10000/default> CURRENT_DATE() - 10 )) a
> 0: jdbc:hive2://localhost:10000/default> LEFT JOIN (SELECT item_id,
> 0: jdbc:hive2://localhost:10000/default> item_site_id,
> 0: jdbc:hive2://localhost:10000/default> auct_end_dt,
> 0: jdbc:hive2://localhost:10000/default> leaf_categ_id
> 0: jdbc:hive2://localhost:10000/default> FROM tbl5
> 0: jdbc:hive2://localhost:10000/default> WHERE auct_end_dt >= ( '2020-01-01' )) itm
> 0: jdbc:hive2://localhost:10000/default> ON a.item_id = itm.item_id
> 0: jdbc:hive2://localhost:10000/default> LEFT JOIN tbl6 ca
> 0: jdbc:hive2://localhost:10000/default> ON itm.leaf_categ_id = ca.leaf_categ_id
> 0: jdbc:hive2://localhost:10000/default> AND itm.item_site_id = ca.site_id;
> +-----------------------------------------------------------------------------------------------------------------------------------------+--+
> | result |
> +-----------------------------------------------------------------------------------------------------------------------------------------+--+
> | 1. Detect the most skew values for join |
> | Check join: Join LeftOuter, ((leaf_categ_id#1453 = leaf_categ_id#3020) AND (cast(item_site_id#1444 as decimal(9,0)) = site_id#3022)) |
> | table: tbl5 |
> | leaf_categ_id, item_site_id, count |
> | 171243, 0, 115412614 |
> | 176984, 3, 81003252 |
> | 176985, 3, 75035585 |
> | 179680, 77, 71831618 |
> | 171243, 3, 69821716 |
> | Check join: Join LeftOuter, (item_id#3244 = cast(item_id#1439 as decimal(20,0))) |
> | table: tbl1 |
> | item_id, count |
> | null, 254738836 |
> | 232644544775, 35530 |
> | 372028681865, 24189 |
> | 124014973844, 21725 |
> | 303476035378, 17075 |
> | |
> | 2. Detect the most skew values for window function |
> | |
> | 3. Detect the bucket read |
> | Check join: SortMergeJoin [item_id#3244], [cast(item_id#1439 as decimal(20,0))], LeftOuter |
> | The data type do not match for bucket read, consider cast `item_id` to decimal(18,0). |
> | |
> | 4. Add a filter condition to partition column |
> | |
> | 5. Check the dangerous join condition |
> | |
> +-----------------------------------------------------------------------------------------------------------------------------------------+--+
> 27 rows selected (152.685 seconds)
> {code}
> The optimized query is:
> {code:sql}
> SELECT a.*,
> CASE
> WHEN ( NOT ( a.exclude = 1
> AND a.cobrand = 6
> AND a.primary_app_id IN ( 1462, 2878, 2571 ) ) )
> AND ( a.valid_page_count = 1 ) THEN 1
> ELSE 0
> END AS is_singlepage,
> ca.bsns_vrtcl_name
> FROM (SELECT *
> FROM (SELECT *,
> 'VI' AS page_type
> FROM tbl1 where item_id is not null -- Checked with user, the null values is not needed. Add a filter condition.
> UNION
> SELECT *,
> 'SRP' AS page_type
> FROM tbl2
> UNION
> SELECT *,
> 'My Garage' AS page_type
> FROM tbl3
> UNION
> SELECT *,
> 'Motors Homepage' AS page_type
> FROM tbl4) t
> WHERE session_start_dt BETWEEN ( '2020-01-01' ) AND (
> CURRENT_DATE() - 10 )) a
> LEFT JOIN (SELECT item_id,
> item_site_id,
> auct_end_dt,
> leaf_categ_id
> FROM tbl5
> WHERE auct_end_dt >= ( '2020-01-01' )) itm
> ON cast(a.item_id as decimal(18, 0)) = cast(itm.item_id as decimal(18, 0)) -- Checked with user. cast to decimal(18, 0) is safe.
> LEFT JOIN tbl6 ca
> ON itm.leaf_categ_id = ca.leaf_categ_id
> AND itm.item_site_id = ca.site_id
> {code}
>
> With these optimizations, the query execution time is reduced from 42 minutes to 6.3 minutes.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org