You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2022/11/15 01:24:00 UTC

[jira] [Work logged] (HIVE-26737) Subquery returning wrong results when database has materialized views

     [ https://issues.apache.org/jira/browse/HIVE-26737?focusedWorklogId=825948&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-825948 ]

ASF GitHub Bot logged work on HIVE-26737:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 15/Nov/22 01:23
            Start Date: 15/Nov/22 01:23
    Worklog Time Spent: 10m 
      Work Description: scarlin-cloudera opened a new pull request, #3761:
URL: https://github.com/apache/hive/pull/3761

   …alized views
   
   When there is a materialized view in the materialized view registry, the HiveMaterializedViewASTSubQueryRewriteShuttle runs and rewrites some RelNodes and RexNodes.
   
   On creation time, the HivePlannerContext is given the RexSubquery nodes which are used to see if it is a correlated subquery with an agg. In the case where the RexSubQuery was rewritten, the Context contains references to stale nodes. It loses the correlated subquery information and creates an incorrect query plan.
   
   The boolean check for subqueries with agg is now done within the Calcite nodes rather than when checking the ASTNodes. It was only used at rule time, so it made more sense for the calculation of the boolean value to be done there, and it's safer in the long run (as opposed to just update the global context when a new RexSubQuery is created).
   
   The HiveFilter and HiveProject will contain the structure holding the calculated correlation information. The information is done with a lazy fetch and only calculated when needed.
   
   The HiveCorrelationInfo structure only contains information for the current subquery level, similar to the old code. A correlated variable cannot go down to a subquery within a subquery at this point.
   
   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://cwiki.apache.org/confluence/display/Hive/HowToContribute
     2. Ensure that you have created an issue on the Hive project JIRA: https://issues.apache.org/jira/projects/HIVE/summary
     3. Ensure you have added or run the appropriate tests for your PR: 
     4. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]HIVE-XXXXX:  Your PR title ...'.
     5. Be sure to keep the PR description updated to reflect all changes.
     6. Please write your PR title to summarize what this PR proposes.
     7. If possible, provide a concise example to reproduce the issue for a faster review.
   
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   Specified in description above
   
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   Specified in description above
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description, screenshot and/or a reproducable example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Hive versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   No
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   Added q test which failed before change, succeeds after




Issue Time Tracking
-------------------

            Worklog Id:     (was: 825948)
    Remaining Estimate: 0h
            Time Spent: 10m

> Subquery returning wrong results when database has materialized views
> ---------------------------------------------------------------------
>
>                 Key: HIVE-26737
>                 URL: https://issues.apache.org/jira/browse/HIVE-26737
>             Project: Hive
>          Issue Type: Bug
>          Components: HiveServer2
>            Reporter: Steve Carlin
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> When HS2 has materialized views in its registry, subqueries with correlated variables may return wrong results.
> An example of this:
>  
> {code:java}
> CREATE TABLE t_test1(
>   id int,
>   int_col int,
>   year int,
>   month int 
> );
> CREATE TABLE t_test2(
>   id int,
>   int_col int,
>   year int,
>   month int 
> );
> CREATE TABLE dummy (
>   id int 
> ) stored as orc TBLPROPERTIES ('transactional'='true');
> CREATE MATERIALIZED VIEW need_a_mat_view_in_registry AS
> SELECT * FROM dummy where id > 5;
> INSERT INTO t_test1 VALUES (1, 1, 2009, 1), (10,0, 2009, 1); 
> INSERT INTO t_test2 VALUES (1, 1, 2009, 1); 
> select id, int_col, year, month from t_test1 s where s.int_col = (select count(*) from t_test2 t where s.id = t.id) order by id; 
> {code}
> The select statement should produce 2 rows, but it is only producing one.
> The CBO plan produced has an inner join instead of a left join.
> {code:java}
> HiveSortLimit(sort0=[$0], dir0=[ASC])
>   HiveProject(id=[$0], int_col=[$1], year=[$2], month=[$3])
>     HiveJoin(condition=[AND(=($0, $5), =($4, $6))], joinType=[inner], algorithm=[none], cost=[not available])
>       HiveProject(id=[$0], int_col=[$1], year=[$2], month=[$3], CAST=[CAST($1):BIGINT])
>         HiveFilter(condition=[AND(IS NOT NULL($0), IS NOT NULL(CAST($1):BIGINT))])
>           HiveTableScan(table=[[default, t_test1]], table:alias=[s])
>       HiveProject(id=[$0], $f1=[$1])
>         HiveFilter(condition=[IS NOT NULL($1)])
>           HiveAggregate(group=[{0}], agg#0=[count()])
>             HiveFilter(condition=[IS NOT NULL($0)])
>               HiveTableScan(table=[[default, t_test2]], table:alias=[t]){code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)