You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Csaba Ringhofer (Jira)" <ji...@apache.org> on 2021/02/25 14:58:00 UTC

[jira] [Commented] (IMPALA-9470) Use Parquet bloom filters

    [ https://issues.apache.org/jira/browse/IMPALA-9470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290967#comment-17290967 ] 

Csaba Ringhofer commented on IMPALA-9470:
-----------------------------------------

About pushing down runtime filters:
There are several things that makes this hard:

a. currently Parquet uses only xxhash in Bloom filters, which is not among the hashes we use for runtime filters (which comes from Kudu: https://github.com/apache/impala/blob/aeeff53e884a67ee7f5980654a1d394c6e3e34ac/be/src/kudu/util/hash.proto#L23 )

b. even if we would have the same hash function, the types can be different, as several mapping is possible between Parquet and Impala column types, and we don't know the Parquet type until we read the file's metadata.

c. We could only skip the whole row group if intersect of the two bloom filters is all zero, which seems very unlikely to me with high NDVs. It would make more sense to me to keep the full set of values in the runtime filters if the NDV is small, and check these on by one in the Parquet file's bloom filter. Note that if NDV in the file is small, then it is likely to contain a dictionary, which makes the bloom filter redundant.

I see more potential in the opposite direction: reading the Parquet bloom filter and distribute it as a runtime filter. If issue a. and b. are solved somehow, then this could allow creating runtime filters earlier.

> Use Parquet bloom filters
> -------------------------
>
>                 Key: IMPALA-9470
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9470
>             Project: IMPALA
>          Issue Type: New Feature
>          Components: Backend
>            Reporter: Zoltán Borók-Nagy
>            Assignee: Daniel Becker
>            Priority: Major
>              Labels: parquet
>
> PARQUET-41 has been closed recently. This means Parquet-MR is capable of writing and reading bloom filters.
> Currently bloom filters are per column chunk entries, i.e. with their help we can filter out entire row groups.
> We already filter row groups in HdfsParquetScanner::NextRowGroup() based on column chunk statistics and dictionaries. Skipping row groups based on bloom filters could be also added to this funciton.
> Impala could also write bloom filters.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org