You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@impala.apache.org by "Csaba Ringhofer (Jira)" <ji...@apache.org> on 2023/02/17 09:31:00 UTC

[jira] [Created] (IMPALA-11928) Try to delay runtime filter generation till NDV is known

Csaba Ringhofer created IMPALA-11928:
----------------------------------------

             Summary: Try to delay runtime filter generation till NDV is known
                 Key: IMPALA-11928
                 URL: https://issues.apache.org/jira/browse/IMPALA-11928
             Project: IMPALA
          Issue Type: Improvement
          Components: Backend
            Reporter: Csaba Ringhofer


Currently runtime filters are initialized before starting to build the build side hash table and are built in parallel to the hash table:
https://github.com/apache/impala/blob/b28da054f3595bb92873433211438306fc22fbc7/be/src/exec/partitioned-hash-join-builder-ir.cc#L66
This means that Impala has to rely on planner time estimates for the bloom filter size to get the desired FPP.

In case the build side fits to memory it is possible to build the hash table first and create the runtime filter by iterating through the keys in the hash table. At this point the NDV of keys can be computed and bloom filters can be set to have optimal sizes.
Agreeing on the correct size is more complex for shuffled joins as different builders may get different key NDV, so synchronization is needed first before starting to build the bloom filters.

If the hash table becomes too large and the builders start to still, it is possibly better to fall back to build the bloom filter in parallel to the hash table instead of rereading the spilled out partitions from disk once all data has arrived. It is possible though that at this point the NDVs are already too large so it is better to disable the filter. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)