You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Zoltán Borók-Nagy (Jira)" <ji...@apache.org> on 2021/01/11 16:22:00 UTC
[jira] [Comment Edited] (IMPALA-10430) Allow limiting duplicate rows from joins

    [ https://issues.apache.org/jira/browse/IMPALA-10430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17262743#comment-17262743 ] 

Zoltán Borók-Nagy edited comment on IMPALA-10430 at 1/11/21, 4:21 PM:
----------------------------------------------------------------------

The users could use ALTER TABLE to set PK/FK on existing tables if they want these checks (seems like currently we don't support adding PK/FK constraints in ALTER TABLE, but I don't see why we couldn't). I think it's a "small price to pay for salvation".

Violations could be also checked if the join order is reversed, i.e. dimension table is on probe side, fact table is on the build side. Because rows from the fact table can only be used once and this can be easily detected.

Although if PK/FK information is at hand, then the planner could always put the fact table to the probe side.


was (Author: boroknagyz):
The users could use ALTER TABLE to set PK/FK on existing tables if they want these checks. I think it's a "small price to pay for salvation".

Violations could be also checked if the join order is reversed, i.e. dimension table is on probe side, fact table is on the build side. Because rows from the fact table can only be used once and this can be easily detected.

Although if PK/FK information is at hand, then the planner could always put the fact table to the probe side.

> Allow limiting duplicate rows from joins
> ----------------------------------------
>
>                 Key: IMPALA-10430
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10430
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>            Reporter: Csaba Ringhofer
>            Priority: Minor
>
> Small mistakes in join conditions can lead to huge number of duplicates and turn a simple query into a massive resource hog until it is cancelled (or runs out of memory). An even worse case is if this happens in an INSERT and data is being dumped to HDFS until some quote is hit.
> An idea for a safety net is to track the  number of "original" rows (that come from scanning a table) and "joined" rows (that come from joins), and abort the query if joined/original is above a certain ratio. 
> E.g. the following query options could control this:
> max_joined_original_ratio
> min_rows_to_enforce_joined_original_ratio (to allow intentional cross joins for smaller datesets)
> So the query could be cancelled if 
> max(max_joined_original_ratio * original, min_rows_to_enforce_joined_original_ratio ) < joined
> We could find default values that would rarely limit sane queries, e.g.
> max_joined_original_ratio = 100
> min_rows_to_enforce_joined_original_ratio=10,000,000,000
> If someone knows that the query should have no duplicates at all then this could be enforced with
> max_joined_original_ratio = 1,
> min_rows_to_enforce_joined_original_ratio=0
> Tracking 'original' and 'joined' could be done several ways, but it is important that it should be done "globally", not per join node , because consecutive joins'  "duplicate factors" are multiplied, e.g. five joins that all  have a ratio of 10x would result in 10000x global ratio. A possibility is to use profile counters in scan and join nodes and sum them in the coordinator.
> Apart from safety from runaway queries this could also help the planner by having enforced limits, especially in the max_joined_original_ratio = 1 case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org