You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@drill.apache.org by br...@apache.org on 2018/12/14 21:01:10 UTC

[drill] branch gh-pages updated: DRILL-6744

This is an automated email from the ASF dual-hosted git repository.

bridgetb pushed a commit to branch gh-pages
in repository https://gitbox.apache.org/repos/asf/drill.git


The following commit(s) were added to refs/heads/gh-pages by this push:
     new f520885  DRILL-6744
f520885 is described below

commit f5208854c534ecba1490aea23f615b072011fea1
Author: Bridget Bevens <bb...@maprtech.com>
AuthorDate: Fri Dec 14 13:00:12 2018 -0800

    DRILL-6744
---
 .../026-parquet-filter-pushdown.md                 | 51 ++++++++++++++++++++--
 1 file changed, 47 insertions(+), 4 deletions(-)

diff --git a/_docs/performance-tuning/026-parquet-filter-pushdown.md b/_docs/performance-tuning/026-parquet-filter-pushdown.md
index beebe78..a1c5416 100644
--- a/_docs/performance-tuning/026-parquet-filter-pushdown.md
+++ b/_docs/performance-tuning/026-parquet-filter-pushdown.md
@@ -1,6 +1,6 @@
 ---
 title: "Parquet Filter Pushdown"
-date: 2018-09-28 21:35:21 UTC
+date: 2018-12-14
 parent: "Performance Tuning"
 ---
 
@@ -23,10 +23,53 @@ Parquet filter pushdown is similar to partition pruning in that it reduces the a
  
 The query planner looks at the minimum and maximum values in each row group for an intersection. If no intersection exists, the planner can prune the row group in the table. If the minimum and maximum value range is too large, Drill does not apply Parquet filter pushdown. The query planner can typically prune more data when the tables in the Parquet file are sorted by row groups.  
 
+##Parquet Filter Pushdown for VARCHAR and DECIMAL Data Types  
+Starting in Drill 1.15, Drill supports Parquet filter pushdown for the VARCHAR and DECIMAL data types. Drill uses binary statistics in the Parquet file or Drill metadata file to push filters on VARCHAR and DECIMAL data types down to the data source.  
+
+###Parquet Generated Files  
+By default, Parquet filter pushdown works for VARCHAR and DECIMAL data types if the Parquet files were created with Parquet version 1.10.0 or later. Drill 1.13 and later uses Parquet 1.10.0 to write and read back Parquet files. 
+
+If Parquet files were created with a pre-1.10.0 version of Parquet, and the data in the binary columns is in ASCII format (not UTC-8), enable the `store.parquet.reader.strings_signed_min_max` option, which allows Drill to use binary statistics in older Parquet files.  
+
+**Note:** DECIMAL filter pushdown only works for Parquet files created by Parquet 1.10.0 or later due to issue [PARQUET-1322](https://issues.apache.org/jira/browse/PARQUET-1322).  
+
+###Parquet Files Created by Hive
+In Hive 2.3, Parquet files are created by a pre-1.10.0 version of Parquet. If the data in the binary columns is in ASCII format, you can enable the `store.parquet.reader.strings_signed_min_max` option to enable pushdown support for VARCHAR data types. DECIMAL filter pushdown is not supported.  
+
+###Drill Generated Metadata Files  
+Parquet filter pushdown for DECIMAL and VARCHAR data types may not work correctly on Drill metadata files that were generated prior to Drill 1.15. Regenerate all Drill metadata files using Drill 1.15 or later to ensure that Parquet filter pushdown works correctly on Drill generated metadata files.
+
+If the `store.parquet.reader.strings_signed_min_max` option is not enabled during regeneration, the minimum and maximum values for the binary data will not be written. When the binary data is in ASCII format, enabling the `store.parquet.reader.strings_signed_min_max` option during regeneration ensures that the minimum and maximum values are written and thus read back and used during filter pushdown.  
+
+###Enabling Statistics Use for Pre-1.10.0 Parquet Files
+If Parquet files were created with a pre-1.10.0 version of Parquet, and the data in binary columns is in ASCII format (not UTF-8), you can enable Drill to use the statistics for Parquet filter pushdown on VARCHAR and DECIMAL data types.
+
+You can use either of the following methods to enable this functionality in Drill:  
+
+- In the `parquet` format plugin configuration, add the `enableStringsSignedMinMax` option, and set the option to `true`, as shown:  
+  
+		"parquet" : {
+      	       type: "parquet",
+     	       enableStringsSignedMinMax: true
+   	    	}  
+
+	This configuration applies to all Parquet files in the `parquet` format plugin to which this storage plugin points, including the configured workspaces.
+
+  
+- From the command line, enable the `store.parquet.reader.strings_signed_min_max` option at the session or system level, as shown:  
+
+		SET `store.parquet.reader.strings_signed_min_max`='true';
+		ALTER SYSTEM SET `store.parquet.reader.strings_signed_min_max`='true';  
+**Note:**  
+	- The `store.parquet.reader.strings_signed_min_max` option allows three values: `'true'`, `'false'`, `''`(empty string). By default, the value is an empty string.  
+	- Setting this option at the system level applies to all Parquet files in the system. Alternatively, you can set this option in the Drill Web UI. Options in the Drill Web UI are set at the system level.  
+	- When set at the session level, the setting takes precedence over the setting in the parquet format plugin and overrides the system level setting.  
+  
+
 ##Using Parquet Filter Pushdown
 Currently, Parquet filter pushdown only supports filters that reference columns from a single table (local filters). Parquet filter pushdown requires the minimum and maximum values in the Parquet file metadata. All Parquet files created in Drill using the CTAS statement contain the necessary metadata. If your Parquet files were created using another tool, you may need to use Drill to read and rewrite the files using the [CTAS command]({{site.baseurl}}/docs/create-table-as-ctas/).
  
-Parquet filter pushdown works best if you presort the data. You do not have to sort the entire data set at once. You can sort a subset of the data set, sort another subset, and so on. 
+Parquet filter pushdown works best if you presort the data. You do not have to sort the entire data set at once. You can sort a subset of the data set, sort another subset, and so on.   
 
 ###Configuring Parquet Filter Pushdown  
 Use the [ALTER SYSTEM|SESSION SET]({{site.baseurl}}/docs/alter-system/) command with the Parquet filter pushdown options to enable or disable the feature, and set the number of row groups for a table.  
@@ -63,14 +106,14 @@ This query performs a join on two tables partitioned by the “month” column.
 ##Support 
 The following table lists the supported and unsupported clauses, operators, data types, function, and scenarios for Parquet filter pushdown:  
 
-**Note:** <sup>1</sup> indicates support as of Drill 1.13. <sup>2</sup> indicates support as of Drill 1.14.  
+**Note:** <sup>1</sup> indicates support as of Drill 1.13. <sup>2</sup> indicates support as of Drill 1.14. <sup>3</sup> indicates support as of Drill 1.15.  
 
 |                      | Supported                                                                                                                                                                                                                                                     | Not Supported                           |
 |----------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------|
 | Clauses              | WHERE,   <sup>1</sup>WITH, HAVING (HAVING is supported if Drill can pass the filter through GROUP   BY.)                                                                                                                                                                 | -                                       |
 | Operators            | <sup>2</sup>BETWEEN,   <sup>2</sup>ITEM, AND, OR, NOT, <sup>1</sup>IS [NOT] NULL, <sup>1</sup>IS [NOT] TRUE, <sup>1</sup>IS [NOT] FALSE, IN (An   IN list is converted to OR if the number in the IN list is within a certain   threshold, for example 20. If greater than the threshold, pruning cannot   occur.) | -                                       |
 | Comparison Operators | <>,   <, >, <=, >=, =                                                                                                                                                                                                                                         | -                                       |
-| Data Types           | INT,   BIGINT, FLOAT, DOUBLE, DATE, TIMESTAMP, TIME, <sup>1</sup>BOOLEAN (true, false)                                                                                                                                                                                   | CHAR,   VARCHAR columns, Hive TIMESTAMP |
+| Data Types           | INT,   BIGINT, FLOAT, DOUBLE, DATE, TIMESTAMP, TIME, <sup>1</sup>BOOLEAN (true, false), <sup>3</sup>VARCHAR columns                                                                                                                                                                                   | CHAR,   Hive TIMESTAMP |
 | Function             | CAST   is supported among the following types only: int, bigint, float, double,   <sup>1</sup>date, <sup>1</sup>timestamp, and <sup>1</sup>time                                                                                                                                                | -                                       |
 | Other                | <sup>2</sup>Enabled   native Hive reader, Files with multiple row groups, <sup>2</sup>Joins                                                                                                                                                                                       | -                                       |