You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues-all@impala.apache.org by "Shant Hovsepian (Jira)" <ji...@apache.org> on 2020/11/15 22:56:00 UTC

[jira] [Created] (IMPALA-10328) Track Keys During Query Planning

Shant Hovsepian created IMPALA-10328:
----------------------------------------

             Summary: Track Keys During Query Planning
                 Key: IMPALA-10328
                 URL: https://issues.apache.org/jira/browse/IMPALA-10328
             Project: IMPALA
          Issue Type: Improvement
          Components: Frontend
            Reporter: Shant Hovsepian


A key for a tuple is the set of attributes whose values uniquely identify the tuple. Any key that can uniquely identify a all tuples in a relation is a _super key_ the minimal set of attributes that make up a _super key_ is usually called a candidate key or primary key.

Keys can be used to track functional dependencies, and provide a way to implement many more optimizations with in the impala planner.

Key information can be explicitly defined or inferred. With Kudu we know which columns make up a primary key of a table or using constraint support from [IMPALA-3531|https://issues.apache.org/jira/browse/IMPALA-3531]. Keys can be inferred after certain operations for example: (the list below is not meant to be exhaustive)
* All slots from a {{SELECT DISTINCT}} query operation
* The grouping expression slots from the result of a {{GROUP BY}} 
* Distinct version of {{UNION/EXCEPT/INTERSECT}}

Armstrong's axioms and a variation of the transitive closure used for value transfers in the Analyzer can be used to calculate any new a derived keys through the query plan.

Keys would need to be implemented to support slots from a single tuple, slots across multiple tuples in the case of joins or unions, and a means to verify if result expressions from a {{QueryStmt}} form a key.

This functionality would help provide the following possible optimizations
* Remove redundant {{DISTINCT}} aggregation
* Minimize the number of slots that need to be hashed for grouping, shuffle, and certain join operations
* Reduce redundant shuffles in distributed planning
* More reliable cardinality estimate







--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org