You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@impala.apache.org by "Csaba Ringhofer (Jira)" <ji...@apache.org> on 2021/01/11 10:29:00 UTC

[jira] [Created] (IMPALA-10430) Allow limiting duplicate rows from joins

Csaba Ringhofer created IMPALA-10430:
----------------------------------------

             Summary: Allow limiting duplicate rows from joins
                 Key: IMPALA-10430
                 URL: https://issues.apache.org/jira/browse/IMPALA-10430
             Project: IMPALA
          Issue Type: Improvement
          Components: Backend
            Reporter: Csaba Ringhofer


Small mistakes in join conditions can lead to huge number of duplicates and turn a simple query into a massive resource hog until it is cancelled (or runs out of memory). An even worse case is if this happens in an INSERT and data is being dumped to HDFS until some quote is hit.

An idea for a safety net is to track the  number of "original" rows (that come from scanning a table) and "joined" rows (that come from joins), and abort the query if joined/original is above a certain ratio. 

E.g. the following query options could control this:
max_joined_original_ratio
min_rows_to_enforce_joined_original_ratio (to allow intentional cross joins for smaller datesets)

So the query could be cancelled if 
max(max_joined_original_ratio * original, min_rows_to_enforce_joined_original_ratio ) < joined

We could find default values that would rarely limit sane queries, e.g.
max_joined_original_ratio = 100, min_rows_to_enforce_joined_original_ratio=10,000,000,000

If someone knows that the query should have no duplicates at all then this could be enforced with
max_joined_original_ratio = 1,
min_rows_to_enforce_joined_original_ratio=0

Tracking 'original' and 'joined' could be done several ways, but it is important that it should be done "globally", not per join node , because consecutive joins'  "duplicate factors" are multiplied, e.g. five joins that all  have a ratio of 10x would result in 10000x global ratio. A possibility is to use profile counters in scan and join nodes and sum them in the coordinator.

Apart from safety from runaway queries this could also help the planner by having enforced limits, especially in the max_joined_original_ratio = 1 case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)