You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Yash Ganthe <ya...@gmail.com> on 2020/07/10 04:38:04 UTC

Does Parquet format provide indexing for quick retrieval based on column filters?

Hi,

If I want to query a parquet file with a criteria such as income > 1000,
does Parquet support indexing of the columns to make it faster to identify
the records with the criteria? I know we can partition the file on a
column. But in my case assume it is already partitioned on a single column
that is Date and I want to use other criteria for filtering the records.

Regards,
Yash

Re: Does Parquet format provide indexing for quick retrieval based on column filters?

Posted by Micah Kornfield <em...@gmail.com>.
Hi Yash,
there are a few mechanisms in Parquet that can help with this.  Not all of
them will be present in every parquet file.  And not all implementations
make use of them or populate them (i.e. C++ lacks a few):
1.  Per Column statistics per-row-group and data pages [1].  Includes
min/max values.
2.  Column indexes [2].
3.  Bloom filters [3]

Thanks,
Micah


[1]
https://github.com/apache/parquet-format/blob/232e23a68ab45be0db2cca5d0991613c9f350f8c/src/main/thrift/parquet.thrift#L197
[2] https://github.com/apache/parquet-format/blob/master/PageIndex.md
[3]
https://github.com/apache/parquet-format/blob/e1dca742bbd0e1eec3a07c70ca53535d678b20dc/BloomFilter.md

On Fri, Jul 10, 2020 at 12:04 PM Yash Ganthe <ya...@gmail.com> wrote:

> Hi,
>
> If I want to query a parquet file with a criteria such as income > 1000,
> does Parquet support indexing of the columns to make it faster to identify
> the records with the criteria? I know we can partition the file on a
> column. But in my case assume it is already partitioned on a single column
> that is Date and I want to use other criteria for filtering the records.
>
> Regards,
> Yash
>