You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Arina Ielchiieva (JIRA)" <ji...@apache.org> on 2019/01/04 09:47:00 UTC

[jira] [Updated] (DRILL-6744) Support filter push down for varchar / decimal data types

     [ https://issues.apache.org/jira/browse/DRILL-6744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arina Ielchiieva updated DRILL-6744:
------------------------------------
    Description: 
Since now Drill is using Apache Parquet 1.10.0 where issue with incorrectly stored varchar / decimal min / max statistics is resolved, we should add support for varchar / decimal filter push down. Only files created with parquet lib 1.9.1 (1.10.0)) and later will be subjected to push down. In cases if user knows that prior created files have correct min / max statistics (i.e. user exactly knows that data in binary columns in ASCII (not UTF-8)) than parquet.strings.signed-min-max.enabled can be set to true to enable filter push down.

*Description*

_Note: Drill is using Parquet 1.10.0 library since 1.13.0 version._

*Varchar Partition Pruning*
Varchar Pruning will work for files generated prior and after Parquet 1.10.0 version, since to enable partition pruning both min and max values should be the same and there are no issues with incorrectly stored statistics for binary data for the same min and max values. Partition pruning using Drill metadata files will also work, no matter when metadata file was created (prior or after Drill 1.15.0).

Partition pruning won't work for files where partition is null due to PARQUET-1341, issue will be fixed in Parquet 1.11.0.

*Varchar Filter Push Down*

Varchar filter push down will work for parquet files created with Parquet 1.10.0 and later.
There are two options how to enable push down for files generated with prior Parquet versions, when user exactly knows that binary data is in ASCII (not UTF-8):
1. set configuration {{enableStringsSignedMinMax}} to true (false by default) for parquet format plugin: 
{noformat}
        "parquet" : {
          type: "parquet",
          enableStringsSignedMinMax: true 
        }
{noformat}

This would apply to all parquet files of a given file plugin, including all workspaces.

2. If user wants to enable / disable allowing reading binary statistics for old parquet files per session, session option {{store.parquet.reader.strings_signed_min_max}} can be used. By default, it has empty string value. Setting such option will take priority over config in parquet format plugin. Option allows three values: 'true', 'false', '' (empty string).

_Note: store.parquet.reader.strings_signed_min_max also can be set at system level, thus it will apply to all parquet files in the system._

The same config / session option will apply to allow reading binary statistics from Drill metadata files generated prior to Drill 1.15.0. If Drill metadata file was created prior to  Drill 1.15.0 but for parquet files created with Parquet library 1.10.0 and later, user would have to enable config / session option or regenerate Drill metadata file with Drill 1.15.0 or later, because from the metadata file we don't know if statistics is stored correctly (prior Drill was writing reading and writing binary statistics by default though did not use it).

When creating Drill metadata file with Drill 1.15.0 and later for old parquet files, user should mind config / session option. If strings_signed_min_max is enabled,  Drill will store in the Drill metadata file binary statistics but since metadata file was created with Drill 1.15.0 and later, Drill would read it back disregarding the option (assuming that if statistics is present in the Drill metadata file, it is correct). If user mistakenly enabled strings_signed_min_max, he needs to disable it and regenerated Drill metadata file. The same is in the opposite way, if user created metadata file when strings_signed_min_max was disabled, no min / max values for binary statistics will be written and thus read back, even if during reading the metadata strings_signed_min_max is enabled.

*Decimal Partition Pruning*

Decimal values can be represented in four logical types: int_32, int_64, fixed_len_byte_array and binary.
Partition pruning will work for all  logical types for old and new decimal files, i.e. created with Parquet 1.10.0, prior and after. Partition pruning won't work for files with null partition due to PARQUET-1341 which will be fixed in Parquet 1.11.0.

Partition pruning with Drill metadata file will work for old and new decimal files disregarding with which Drill version metadata file was created.

*Decimal Filter Push Down*

For int_32 / int_64 decimal push down will work only for new files (i.e. generated by Parquet 1.10.0 and later), for old files push down won't work due to PARQUET-1322.

For old int_32 / int_64 decimal push down will work with old Drill metadata file, i.e. prior to Drill 1.14.0, for Drill metadata file generated after Drill 1.14.0 push down won't work since it is generated after upgrade to Parquet 1.10.0 (due to PARQUET-1322). For new int_32 / int_64 decimal, push down will work with old and new Drill metadata file.

For old fixed_len_byte_array / binary decimal files generated prior to Parquet 1.10.0 filter push down won't work. Push down with old Drill metadata file only if strings_signed_min_max config / session option is set to true. Push down with new Drill metadata file won't work.

For new fixed_len_byte_array / binary files filter push down will work with and without metadata file (only if Drill metadata file was generated by Drill 1.15.0). If Drill metadata file was generated prior to Drill 1.15.0, to enable reading such statistics user needs to enable strings_signed_min_max config / session option or re-generated Drill metadata file.

*Hive Varchar Filter Push Down using Drill native reader*

Hive 2.3 parquet files are generated with Parquet library prior to 1.10.0 version, where statistics for binary UTF-8 is can be stored incorrectly. If user exactly knows that data in the binary columns in ASCII (not in UTF-8), session option store.parquet.reader.strings_signed_min_max can be set to 'true' to enable varchar filter push down.

*Hive Decimal Filter Push Down using Drill native reader*

Hive 2.3 parquet files are generated with Parquet library prior to 1.10.0 version, decimal statistics for such files is not available thus push down won't work with Hive parquet decimal files.

  was:
Since now Drill is using Apache Parquet 1.10.0 where issue with incorrectly stored varchar / decimal min / max statistics is resolved, we should add support for varchar / decimal filter push down. Only files created with parquet lib 1.9.1 (1.10.0)) and later will be subjected to push down. In cases if user knows that prior created files have correct min / max statistics (i.e. user exactly knows that data in binary columns in ASCII (not UTF-8)) than parquet.strings.signed-min-max.enabled can be set to true to enable filter push down.

*Description*

_Note: Drill is using Parquet 1.10.0 library since 1.13.0 version._

*Varchar Partition Pruning*
Varchar Pruning will work for files generated prior and after Parquet 1.10.0 version, since to enable partition pruning both min and max values should be the same and there are no issues with incorrectly stored statistics for binary data for the same min and max values. Partition pruning using Drill metadata files will also work, no matter when metadata file was created (prior or after Drill 1.15.0).

Partition pruning won't work for files where partition is null due to PARQUET-1341, issue will be fixed in Parquet 1.11.0.

*Varchar Filter Push Down*

Varchar filter push down will work for parquet files created with Parquet 1.10.0 and later.
There are two options how to enable push down for files generated with prior Parquet versions, when user exactly knows that binary data is in ASCII (not UTF-8):
1. set configuration {{enableStringsSignedMinMax}} to true (false by default) for parquet format plugin: 
{noformat}
        "parquet" : {
          type: "parquet",
          enableStringsSignedMinMax: true 
        }
{noformat}

This would apply to all parquet files of a given file plugin, including all workspaces.

2. If user wants to enable / disable allowing reading binary statistics for old parquet files per session, session option {{store.parquet.reader.strings_signed_min_max}} can be used. By default, it has empty string value. Setting such option will take priority over config in parquet format plugin. Option allows three values: 'true', 'false', '' (empty string).

_Note: store.parquet.reader.strings_signed_min_max also can be set at system level, thus it will apply to all parquet files in the system._

The same config / session option will apply to allow reading binary statistics from Drill metadata files generated prior to Drill 1.15.0. If Drill metadata file was created prior to  Drill 1.15.0 but for parquet files created with Parquet library 1.10.0 and later, user would have to enable config / session option or regenerate Drill metadata file with Drill 1.15.0 or later, because from the metadata file we don't know if statistics is stored correctly (prior Drill was writing reading and writing binary statistics by default though did not use it).

When creating Drill metadata file with Drill 1.15.0 and later for old parquet files, user should mind config / session option. If strings_signed_min_max is enabled,  Drill will store in the Drill metadata file binary statistics but since metadata file was created with Drill 1.15.0 and later, Drill would read it back disregarding the option (assuming that if statistics is present in the Drill metadata file, it is correct). If user mistakenly enabled strings_signed_min_max, he needs to disable it and regenerated Drill metadata file. The same is in the opposite way, if user created metadata file when strings_signed_min_max was disabled, no min / max values for binary statistics will be written and thus read back, even if during reading the metadata strings_signed_min_max is enabled.

*Decimal Partition Pruning*

Decimal values can be represented in four logical types: int_32, int_64, fixed_len_byte_array and binary.
Partition pruning will work for all  logical types for old and new decimal files, i.e. created with Parquet 1.10.0, prior and after. Partition pruning won't work for files with null partition due to PARQUET-1341 which will be fixed in Parquet 1.11.0.

Partition pruning with Drill metadata file will work for old and new decimal files disregarding with which Drill version metadata file was created.

*Decimal Filter Push Down*

For int_32 / int_64 decimal push down will work only for new files (i.e. generated by Parquet 1.10.0 and later), for old files push down won't work due to PARQUET-1322.

For old int_32 / int_64 decimal push down will work with old Drill metadata file, i.e. prior to Drill 1.14.0, for Drill metadata file generated after Drill 1.14.0 push down won't work since it is generated after upgrade to Parquet 1.10.0 (due to PARQUET-1322). For new int_32 / int_64 decimal, push down will work with new Drill metadata file.

For old fixed_len_byte_array / binary decimal files generated prior to Parquet 1.10.0 filter push down won't work. Push down with old Drill metadata file only if strings_signed_min_max config / session option is set to true. Push down with new Drill metadata file won't work.

For new fixed_len_byte_array / binary files filter push down will work with and without metadata file (only if Drill metadata file was generated by Drill 1.15.0). If Drill metadata file was generated prior to Drill 1.15.0, to enable reading such statistics user needs to enable strings_signed_min_max config / session option or re-generated Drill metadata file.

*Hive Varchar Filter Push Down using Drill native reader*

Hive 2.3 parquet files are generated with Parquet library prior to 1.10.0 version, where statistics for binary UTF-8 is can be stored incorrectly. If user exactly knows that data in the binary columns in ASCII (not in UTF-8), session option store.parquet.reader.strings_signed_min_max can be set to 'true' to enable varchar filter push down.

*Hive Decimal Filter Push Down using Drill native reader*

Hive 2.3 parquet files are generated with Parquet library prior to 1.10.0 version, decimal statistics for such files is not available thus push down won't work with Hive parquet decimal files.


> Support filter push down for varchar / decimal data types
> ---------------------------------------------------------
>
>                 Key: DRILL-6744
>                 URL: https://issues.apache.org/jira/browse/DRILL-6744
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.14.0
>            Reporter: Arina Ielchiieva
>            Assignee: Arina Ielchiieva
>            Priority: Major
>              Labels: doc-complete, ready-to-commit
>             Fix For: 1.15.0
>
>
> Since now Drill is using Apache Parquet 1.10.0 where issue with incorrectly stored varchar / decimal min / max statistics is resolved, we should add support for varchar / decimal filter push down. Only files created with parquet lib 1.9.1 (1.10.0)) and later will be subjected to push down. In cases if user knows that prior created files have correct min / max statistics (i.e. user exactly knows that data in binary columns in ASCII (not UTF-8)) than parquet.strings.signed-min-max.enabled can be set to true to enable filter push down.
> *Description*
> _Note: Drill is using Parquet 1.10.0 library since 1.13.0 version._
> *Varchar Partition Pruning*
> Varchar Pruning will work for files generated prior and after Parquet 1.10.0 version, since to enable partition pruning both min and max values should be the same and there are no issues with incorrectly stored statistics for binary data for the same min and max values. Partition pruning using Drill metadata files will also work, no matter when metadata file was created (prior or after Drill 1.15.0).
> Partition pruning won't work for files where partition is null due to PARQUET-1341, issue will be fixed in Parquet 1.11.0.
> *Varchar Filter Push Down*
> Varchar filter push down will work for parquet files created with Parquet 1.10.0 and later.
> There are two options how to enable push down for files generated with prior Parquet versions, when user exactly knows that binary data is in ASCII (not UTF-8):
> 1. set configuration {{enableStringsSignedMinMax}} to true (false by default) for parquet format plugin: 
> {noformat}
>         "parquet" : {
>           type: "parquet",
>           enableStringsSignedMinMax: true 
>         }
> {noformat}
> This would apply to all parquet files of a given file plugin, including all workspaces.
> 2. If user wants to enable / disable allowing reading binary statistics for old parquet files per session, session option {{store.parquet.reader.strings_signed_min_max}} can be used. By default, it has empty string value. Setting such option will take priority over config in parquet format plugin. Option allows three values: 'true', 'false', '' (empty string).
> _Note: store.parquet.reader.strings_signed_min_max also can be set at system level, thus it will apply to all parquet files in the system._
> The same config / session option will apply to allow reading binary statistics from Drill metadata files generated prior to Drill 1.15.0. If Drill metadata file was created prior to  Drill 1.15.0 but for parquet files created with Parquet library 1.10.0 and later, user would have to enable config / session option or regenerate Drill metadata file with Drill 1.15.0 or later, because from the metadata file we don't know if statistics is stored correctly (prior Drill was writing reading and writing binary statistics by default though did not use it).
> When creating Drill metadata file with Drill 1.15.0 and later for old parquet files, user should mind config / session option. If strings_signed_min_max is enabled,  Drill will store in the Drill metadata file binary statistics but since metadata file was created with Drill 1.15.0 and later, Drill would read it back disregarding the option (assuming that if statistics is present in the Drill metadata file, it is correct). If user mistakenly enabled strings_signed_min_max, he needs to disable it and regenerated Drill metadata file. The same is in the opposite way, if user created metadata file when strings_signed_min_max was disabled, no min / max values for binary statistics will be written and thus read back, even if during reading the metadata strings_signed_min_max is enabled.
> *Decimal Partition Pruning*
> Decimal values can be represented in four logical types: int_32, int_64, fixed_len_byte_array and binary.
> Partition pruning will work for all  logical types for old and new decimal files, i.e. created with Parquet 1.10.0, prior and after. Partition pruning won't work for files with null partition due to PARQUET-1341 which will be fixed in Parquet 1.11.0.
> Partition pruning with Drill metadata file will work for old and new decimal files disregarding with which Drill version metadata file was created.
> *Decimal Filter Push Down*
> For int_32 / int_64 decimal push down will work only for new files (i.e. generated by Parquet 1.10.0 and later), for old files push down won't work due to PARQUET-1322.
> For old int_32 / int_64 decimal push down will work with old Drill metadata file, i.e. prior to Drill 1.14.0, for Drill metadata file generated after Drill 1.14.0 push down won't work since it is generated after upgrade to Parquet 1.10.0 (due to PARQUET-1322). For new int_32 / int_64 decimal, push down will work with old and new Drill metadata file.
> For old fixed_len_byte_array / binary decimal files generated prior to Parquet 1.10.0 filter push down won't work. Push down with old Drill metadata file only if strings_signed_min_max config / session option is set to true. Push down with new Drill metadata file won't work.
> For new fixed_len_byte_array / binary files filter push down will work with and without metadata file (only if Drill metadata file was generated by Drill 1.15.0). If Drill metadata file was generated prior to Drill 1.15.0, to enable reading such statistics user needs to enable strings_signed_min_max config / session option or re-generated Drill metadata file.
> *Hive Varchar Filter Push Down using Drill native reader*
> Hive 2.3 parquet files are generated with Parquet library prior to 1.10.0 version, where statistics for binary UTF-8 is can be stored incorrectly. If user exactly knows that data in the binary columns in ASCII (not in UTF-8), session option store.parquet.reader.strings_signed_min_max can be set to 'true' to enable varchar filter push down.
> *Hive Decimal Filter Push Down using Drill native reader*
> Hive 2.3 parquet files are generated with Parquet library prior to 1.10.0 version, decimal statistics for such files is not available thus push down won't work with Hive parquet decimal files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)