You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@impala.apache.org by Wang Chunling <wa...@sina.com> on 2017/07/19 10:22:50 UTC

What is dictionary filter in Impala?

Hi,

I find there is dictionary filter in Impala when doing Parquet scan. The comment says the column is 100% dictionary encoded can be dictionary filtered. Can you explain what kind of columns can be dictionary encoded? And is there any example of dictionary filter? Thanks a lot.


Chunling

Re: What is dictionary filter in Impala?

Posted by Tim Armstrong <ta...@cloudera.com>.
Hi,
  The Parquet format supports various encodings that help compress columns
of data with different characteristics. Dictionary encoding is useful if
there are many repeats of the same value in the same column. E.g. if you
have a string column with country names - you might have "Australia",
"USA", "China" repeated many times. If there are <= 40,000 distinct values
a column can be encoded with a dictionary: at the start of the column there
is a dictionary with all of the distinct values, then the data is
represented as integers.

 E.g. if the dictionary was ["Australia", "USA", "China"], then "China"
would be encoded as 2.

Dictionary filtering takes advantage of this to speed up scans. E.g. if I
have a query like "select * from my_table where country = 'Iceland'", then
we can check the dictionary for a Parquet row group before scanning the row
group. If no entries in the dictionary match the condition, then we can
skip the whole row group.

On Wed, Jul 19, 2017 at 3:22 AM, Wang Chunling <wa...@sina.com>
wrote:

> Hi,
>
> I find there is dictionary filter in Impala when doing Parquet scan. The
> comment says the column is 100% dictionary encoded can be dictionary
> filtered. Can you explain what kind of columns can be dictionary encoded?
> And is there any example of dictionary filter? Thanks a lot.
>
>
> Chunling