You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@impala.apache.org by Lars Volker <lv...@cloudera.com> on 2017/11/01 22:16:09 UTC
Re: Parquet min/max statistics & null values

Hi Bruno,

Parquet-mr currently doesn't support the new stats fields:
https://issues.apache.org/jira/browse/PARQUET-1025

For development purposes I used a small Python script a while ago. I pushed
it here so you can check it out if you'd like, but it's really only meant
for development purposes:
https://github.com/lekv/impala/tree/dump_parquet_stats

Cheers, Lars


On Sat, Oct 28, 2017 at 1:19 PM, Bruno Quinart <bq...@icloud.com> wrote:

> Hi
>
> Thanks for your replies.
>
> An example query:
> SELECT * FROM reasonably_sized_partitioned_table
> WHERE partition_key_dtnr = 20171028
> AND sorted_column = 56789123456
> (same effect with an IN instead of the last equality)
>
> I checked the behavior in our cluster again and can now confirm Lars
> explanation:
> - Filtering on stats works well as from 2.9, except when there is only
> null values for the column in a parquet file
> - And that would be addressed by IMPALA-6113
> I was wrongly assuming that in some of my tables there would not have been
> enough null values to fill a parquet file for each partition.
>
> I wonder how you verify those stats in Parquet metadata. Parquet-tools
> currently does not print them.
>
> Thanks for the help
> Bruno
>
> On 27 Oct, 2017, at 12:54 AM, Lars Volker wrote:
>
> Hi Bruno,
>
> To clarify on your original observation: Column chunks that only have null
> values in them will not have their min_value and max_value fields populated
> and thus won't be skipped based on stats. I filed IMPALA-6113
> <https://issues.apache.org/jira/browse/IMPALA-6113> to track this.
> IMPALA-5061 <https://issues.apache.org/jira/browse/IMPALA-5061> added
> support to populate the null_count in statistics, allowing us to detect
> column chunks that only contain NULLs. We should use that information to
> skip row groups if the predicate allows us to.
>
> Row groups with column chunks that have at least one non-null value should
> get filtered correctly.
>
> Cheers, Lars
>
> On Thu, Oct 26, 2017 at 10:02 AM, Tim Armstrong wrote:
>
> Hi Bruno,
>
> Could you provide an example of the specific predicates that aren't being
>
> used to successfully skip the row group?
>
>
> - Tim
>
>
> On Thu, Oct 26, 2017 at 7:21 AM, Jeszy wrote:
>
>
> Hello Bruno,
>
>
> Thanks for bringing this up. While not apparent from the commit
>
> comments, this limitation was mentioned during the code review:
>
> 'min/max are only set when there are non-null values, so we don't
>
> consider statistics for "is null".' (see
>
> https://gerrit.cloudera.org/#/c/6147/).
>
> It looks to me that this was intended, but I'll let others confirm.
>
> Definitely a point where we can improve.
>
>
> Thanks!
>
>
> On 26 October 2017 at 08:02, Bruno Quinart wrote:
>
> > Hi all
>
> >
>
> > With IMPALA-2328, Parquet row group statistics are now being used to
>
> skip
>
> > the row group completely if the min/max range is excluded from the
>
> > predicate.
>
> > We have a use case in which we make sure the data is sorted on a 'key'
>
> and
>
> > have then many selective queries on that 'key' field. We notice a
>
> > significant performance increase.
>
> > So thanks a lot for all the work on that!
>
> >
>
> > One thing we notice is an unexpected behavior for records where that
>
> 'key'
>
> > has null values. It seems that as soon as null values are present in a
>
> row
>
> > group, the test on the min/max fails and the row group is read.
>
> >
>
> > We work with Impala 2.9. The data is put in parquet files by Impala
>
> itself.
>
> > We have noticed this effect for both bigint as decimal fields. Note that
>
> > it's difficult for me to extract the min/max statistics from the parquet
>
> > files. The parquet-tools included in our distribution (5.12) is not the
>
> > latest. And I was told PARQUET-327 would anyway not print the those row
>
> > group stats because of the way Impala stores them.
>
> > We do confirm the expected behavior (exactly one row group read for
>
> properly
>
> > sorted data) when we create a similar table but explicitly filter out
>
> all
>
> > null values for that 'key' field. We also notice that the the number of
>
> row
>
> > groups read (but zero records retained) is proportional to the number of
>
> > null values.
>
> >
>
> > Is this behavior expected?
>
> > Is there a fundamental reason those row groups can not be skipped?
>
> >
>
> > Thanks!
>
> > Bruno
>
> >
>
>
>
>
>