You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Zoltán Borók-Nagy (Jira)" <ji...@apache.org> on 2021/01/11 13:46:00 UTC

[jira] [Commented] (IMPALA-10430) Allow limiting duplicate rows from joins

    [ https://issues.apache.org/jira/browse/IMPALA-10430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17262647#comment-17262647 ] 

Zoltán Borók-Nagy commented on IMPALA-10430:
--------------------------------------------

I think this problem could be more naturally handled by marking key columns with PRIMARY KEY. We don't support PRIMARY KEYs currently (only for Kudu) because we don't want to check the constraint during insertions. However, we could use this information during joins.

E.g. for equi-joins if one side is a primary key column then we'd know that the JOIN cannot produce more rows than MAX(left child, right child).

> Allow limiting duplicate rows from joins
> ----------------------------------------
>
>                 Key: IMPALA-10430
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10430
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>            Reporter: Csaba Ringhofer
>            Priority: Minor
>
> Small mistakes in join conditions can lead to huge number of duplicates and turn a simple query into a massive resource hog until it is cancelled (or runs out of memory). An even worse case is if this happens in an INSERT and data is being dumped to HDFS until some quote is hit.
> An idea for a safety net is to track the  number of "original" rows (that come from scanning a table) and "joined" rows (that come from joins), and abort the query if joined/original is above a certain ratio. 
> E.g. the following query options could control this:
> max_joined_original_ratio
> min_rows_to_enforce_joined_original_ratio (to allow intentional cross joins for smaller datesets)
> So the query could be cancelled if 
> max(max_joined_original_ratio * original, min_rows_to_enforce_joined_original_ratio ) < joined
> We could find default values that would rarely limit sane queries, e.g.
> max_joined_original_ratio = 100
> min_rows_to_enforce_joined_original_ratio=10,000,000,000
> If someone knows that the query should have no duplicates at all then this could be enforced with
> max_joined_original_ratio = 1,
> min_rows_to_enforce_joined_original_ratio=0
> Tracking 'original' and 'joined' could be done several ways, but it is important that it should be done "globally", not per join node , because consecutive joins'  "duplicate factors" are multiplied, e.g. five joins that all  have a ratio of 10x would result in 10000x global ratio. A possibility is to use profile counters in scan and join nodes and sum them in the coordinator.
> Apart from safety from runaway queries this could also help the planner by having enforced limits, especially in the max_joined_original_ratio = 1 case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org