You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Gabor Kaszab (Jira)" <ji...@apache.org> on 2023/12/18 14:44:00 UTC

[jira] [Created] (IMPALA-12649) Use max(data_sequence_number) fo joining equality delete rows

Gabor Kaszab created IMPALA-12649:
-------------------------------------

             Summary: Use max(data_sequence_number) fo joining equality delete rows
                 Key: IMPALA-12649
                 URL: https://issues.apache.org/jira/browse/IMPALA-12649
             Project: IMPALA
          Issue Type: Sub-task
          Components: Frontend
            Reporter: Gabor Kaszab


improvement idea for the future:

If Flink always writes EQ-delete files, and uses the same primary key a lot, we will have the same entry in the HashMap with multiple data sequence numbers. Then during probing, for each hash table lookup we need to loop over all the sequence numbers and check them. Actually we only need the largest data sequence number, the lower sequence numbers with the same primary keys don't add any value.

So we could add an Aggregation node to the right side of the join, like "PK1, PK2, ..., max(data_sequence_number), group by PK1, PK2, ...".

Now, we would need to decide when to add this node to the plan, or when we shouldn't. We should also avoid having an EXCHANGE between the aggregation node and the JOIN node, as it would be redundant as they would use the same partition key expressions (the primary keys).

If we had "hash teams" in Impala, we could always add this aggregator operator, as it would be in the same "hash team" with the JOIN operator, i.e. we wouldn't need to build the hash table twice. Microsoft's paper about hash joins and hash teams: [https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=fc1c78cbef5062cf49fdb309b1935af08b759d2d]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)