You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2021/03/20 00:52:00 UTC

[jira] [Work logged] (HIVE-24606) Multi-stage materialized CTEs can lose intermediate data

     [ https://issues.apache.org/jira/browse/HIVE-24606?focusedWorklogId=569202&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-569202 ]

ASF GitHub Bot logged work on HIVE-24606:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 20/Mar/21 00:51
            Start Date: 20/Mar/21 00:51
    Worklog Time Spent: 10m 
      Work Description: github-actions[bot] commented on pull request #1873:
URL: https://github.com/apache/hive/pull/1873#issuecomment-803210686


   This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
   Feel free to reach out on the dev@hive.apache.org list if the patch is in need of reviews.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 569202)
    Time Spent: 40m  (was: 0.5h)

> Multi-stage materialized CTEs can lose intermediate data
> --------------------------------------------------------
>
>                 Key: HIVE-24606
>                 URL: https://issues.apache.org/jira/browse/HIVE-24606
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Planning
>    Affects Versions: 2.3.7, 3.1.2, 4.0.0
>            Reporter: okumin
>            Assignee: okumin
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> With complex multi-stage CTEs, Hive can start a latter stage before its previous stage finishes.
>  That's because `SemanticAnalyzer#toRealRootTasks` can fail to resolve dependency between multistage materialized CTEs when a non-materialized CTE cuts in.
>  [https://github.com/apache/hive/blob/425e1ff7c054f87c4db87e77d004282d529599ae/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L1414]
>  
> For example, when submitting this query,
> {code:sql}
> SET hive.optimize.cte.materialize.threshold=2;
> SET hive.optimize.cte.materialize.full.aggregate.only=false;
> WITH x AS ( SELECT 'x' AS id ), -- not materialized
> a1 AS ( SELECT 'a1' AS id ), -- materialized by a2 and the root
> a2 AS ( SELECT 'a2 <- ' || id AS id FROM a1) -- materialized by the root
> SELECT * FROM a1
> UNION ALL
> SELECT * FROM x
> UNION ALL
> SELECT * FROM a2
> UNION ALL
> SELECT * FROM a2;
> {code}
> `toRealRootTask` will traverse the CTEs in order of `a1`, `x`, and `a2`. It means the dependency between `a1` and `a2` will be ignored and `a2` can start without waiting for `a1`. As a result, the above query returns the following result.
> {code:java}
> +-----+
> | id  |
> +-----+
> | a1  |
> | x   |
> +-----+
> {code}
> For your information, I ran this test with revision = 425e1ff7c054f87c4db87e77d004282d529599ae.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)