You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@drill.apache.org by br...@apache.org on 2019/04/29 20:49:01 UTC
[drill] branch gh-pages updated: edit refresh and schema docs

This is an automated email from the ASF dual-hosted git repository.

bridgetb pushed a commit to branch gh-pages
in repository https://gitbox.apache.org/repos/asf/drill.git


The following commit(s) were added to refs/heads/gh-pages by this push:
     new cb22336  edit refresh and schema docs
cb22336 is described below

commit cb22336129c6edf72f60747b2950da7d91f90d3d
Author: Bridget Bevens <bb...@maprtech.com>
AuthorDate: Mon Apr 29 13:48:17 2019 -0700

    edit refresh and schema docs
---
 .../sql-commands/011-refresh-table-metadata.md     | 321 +++++++++++----------
 .../sql-commands/021-create-schema.md              |   8 +-
 2 files changed, 170 insertions(+), 159 deletions(-)

diff --git a/_docs/sql-reference/sql-commands/011-refresh-table-metadata.md b/_docs/sql-reference/sql-commands/011-refresh-table-metadata.md
index d678619..2af0d0b 100644
--- a/_docs/sql-reference/sql-commands/011-refresh-table-metadata.md
+++ b/_docs/sql-reference/sql-commands/011-refresh-table-metadata.md
@@ -1,6 +1,6 @@
 ---
 title: "REFRESH TABLE METADATA"
-date: 2019-04-23
+date: 2019-04-29
 parent: "SQL Commands"
 ---
 Run the REFRESH TABLE METADATA command on Parquet tables and directories to generate a metadata cache file. REFRESH TABLE METADATA collects metadata from the footers of Parquet files and writes the metadata to a metadata file (`.drill.parquet_file_metadata.v4`) and a summary file (`.drill.parquet_summary_metadata.v4`). The planner uses the metadata cache file to prune extraneous data during the query planning phase. Run the REFRESH TABLE METADATA command if planning time is a significant [...]
@@ -34,7 +34,7 @@ Run the [EXPLAIN]({{site.baseurl}}/docs/explain/) command to determine the query
 ## Usage Notes  
 
 ### Metadata Storage  
-- Drill traverses directories for Parquet files and gathers the metadata from the footer of the files. Drill stores the collected metadata in a metadata cache file, `.drill.parquet_metadata`, at each directory level.  
+- Drill traverses directories for Parquet files and gathers the metadata from the footer of the files. Drill stores the collected metadata in a metadata cache file, `.drill.parquet_file_metadata.v4`, a summary file, `.drill.parquet_summary_metadata.v4`, and a directories file, `.drill.parquet_metadata_directories` file at each directory level.     
 - The metadata cache file stores metadata for files in that directory, as well as the metadata for the files in the subdirectories.  
 - For each row group in a Parquet file, the metadata cache file stores the column names in the row group and the column statistics, such as the min/max values and null count.  
 - If the Parquet data is updated, for example data is added to a file, Drill automatically  refreshes the Parquet metadata when you issue the next query against the Parquet data.  
@@ -70,22 +70,31 @@ Sets the number of row groups that a table can have. You can increase the thresh
 ## Limitations
 Currently, Drill does not support runtime rowgroup pruning. 
 
-<!--
-## Examples  
-These examples use a schema, `dfs.samples`, which points to the `/home` directory. The `/home` directory contains a subdirectory, `parquet`, which
-contains the `nation.parquet` and a subdirectory, `dir1` with the `region.parquet` file. You can access the `nation.parquet` and `region.parquet` Parquet files in the `sample-data` directory of your Drill installation.  
 
-	[root@doc23 dir1]# pwd
-	/home/parquet/dir1
-	 
+## Examples  
+These examples use a schema, `dfs.samples`, which points to the `/tmp` directory. The `/tmp` directory contains the following subdirectories and files used in the examples:  
+
+	[root@doc23 parquet1]# pwd
+	/tmp/parquet1
+	
+	[root@doc23 parquet1]# ls
+	Parquet
+	
+	[root@doc23 parquet1]# cd parquet
+	
 	[root@doc23 parquet]# ls
-	dir1  nation.parquet
-	 
-	[root@doc23 dir1]# ls
-	region.parquet  
+	nation.parquet  test
+	
+	[root@doc23 parquet]# cd test
+	
+	[root@doc23 test]# ls
+	nation.parquet
+
+**Note:** You can access the sample `nation.parquet` file in the `sample-data` directory of your Drill installation.
 
-Change schemas to use `dfs.samples`:
  
+Change schemas to switch to `dfs.samples`: 
+
 	use dfs.samples;
 	+-------+------------------------------------------+
 	|  ok   |                 summary        	      |
@@ -93,37 +102,113 @@ Change schemas to use `dfs.samples`:
 	| true  | Default schema changed to [dfs.samples]  |
 	+-------+------------------------------------------+  
 
-### Running REFRESH TABLE METADATA on a Directory  
-Running the REFRESH TABLE METADATA command on the `parquet` directory generates metadata cache files at each directory level.  
-
-	REFRESH TABLE METADATA parquet;  
-	+-------+---------------------------------------------------+
-	|  ok   |                  	summary                  	|
-	+-------+---------------------------------------------------+
-	| true  | Successfully updated metadata for table parquet.  |
-	+-------+---------------------------------------------------+  
-
-When looking at the `parquet` directory and `dir1` subdirectory, you can see that a metadata cache file was created at each level:
-
+### Running REFRESH TABLE METADATA on a Directory
+Running the REFRESH TABLE METADATA command on the “parquet1” directory generates metadata cache files at each directory level.
+
+	apache drill (dfs.samples)> REFRESH TABLE METADATA parquet1;
+	+------+---------------------------------------------------+
+	|  ok  |                      summary                      |
+	+------+---------------------------------------------------+
+	| true | Successfully updated metadata for table parquet1. |
+	+------+---------------------------------------------------+
+
+When looking at the “parquet1” directory and subdirectories, you can see that a metadata cache and summary (hidden) files were created at each level:
+
+**Note:** The CRC files are Cyclical Redundancy Check checksum files used to verify the data integrity of other files. 
+
+	[root@doc23 parquet1]# ls -la
+	total 36
+	drwxr-xr-x   3 root root  284 Apr 29 11:46 .
+	drwxrwxrwt. 51 root root 8192 Apr 29 11:44 ..
+	-rw-r--r--   1 root root 1037 Apr 29 11:46 .drill.parquet_file_metadata.v4
+	-rw-r--r--   1 root root   20 Apr 29 11:46 ..drill.parquet_file_metadata.v4.crc
+	-rw-r--r--   1 root root   51 Apr 29 11:46 .drill.parquet_metadata_directories
+	-rw-r--r--   1 root root   12 Apr 29 11:46 ..drill.parquet_metadata_directories.crc
+	-rw-r--r--   1 root root 1334 Apr 29 11:46 .drill.parquet_summary_metadata.v4
+	-rw-r--r--   1 root root   20 Apr 29 11:46 ..drill.parquet_summary_metadata.v4.crc
+	drwxr-xr-x   3 root root  212 Apr 29 11:30 parquet  
+	
+	[root@doc23 parquet1]# cd parquet
 	[root@doc23 parquet]# ls -la
-	drwxr-xr-x   2 root root   95 Mar 18 17:49 dir1
-	-rw-r--r--   1 root root 2642 Mar 18 17:52 .drill.parquet_metadata
-	-rw-r--r--   1 root root   32 Mar 18 17:52 ..drill.parquet_metadata.crc
-	-rwxr-xr-x   1 root root 1210 Mar 13 13:32 nation.parquet
-	 
-	[root@doc23 dir1]# ls -la
-	-rw-r--r-- 1 root root 1235 Mar 18 17:52 .drill.parquet_metadata
-	-rw-r--r-- 1 root root   20 Mar 18 17:52 ..drill.parquet_metadata.crc
-	-rwxr-xr-x 1 root root  455 Mar 18 17:41 region.parquet  
-
-The following sections compare the content of the metadata cache file in  the `parquet` and `dir1` directories:  
+	total 20
+	drwxr-xr-x 3 root root  212 Apr 29 11:30 .
+	drwxr-xr-x 3 root root  284 Apr 29 11:46 ..
+	-rw-r--r-- 1 root root 1021 Apr 29 11:46 .drill.parquet_file_metadata.v4
+	-rw-r--r-- 1 root root   16 Apr 29 11:46 ..drill.parquet_file_metadata.v4.crc
+	-rw-r--r-- 1 root root 1315 Apr 29 11:46 .drill.parquet_summary_metadata.v4
+	-rw-r--r-- 1 root root   20 Apr 29 11:46 ..drill.parquet_summary_metadata.v4.crc
+	-rwxr-xr-x 1 root root 1210 Apr 29 11:23 nation.parquet
+	drwxr-xr-x 2 root root  200 Apr 29 11:46 test
+	
+	[root@doc23 test]# ls -la
+	total 20
+	drwxr-xr-x 2 root root  200 Apr 29 11:46 .
+	drwxr-xr-x 3 root root  212 Apr 29 11:30 ..
+	-rw-r--r-- 1 root root  517 Apr 29 11:46 .drill.parquet_file_metadata.v4
+	-rw-r--r-- 1 root root   16 Apr 29 11:46 ..drill.parquet_file_metadata.v4.crc
+	-rw-r--r-- 1 root root 1308 Apr 29 11:46 .drill.parquet_summary_metadata.v4
+	-rw-r--r-- 1 root root   20 Apr 29 11:46 ..drill.parquet_summary_metadata.v4.crc
+	-rwxr-xr-x 1 root root 1210 Apr 29 11:23 nation.parquet  
+
+Looking at the `.drill.parquet_file_metadata.v4` file in the `/tmp/parquet1` directory, you can see that the file contains the paths to the Parquet files in the subdirectories, as well as metadata for those files: 
+
+	[root@doc23 parquet1]# cat .drill.parquet_file_metadata.v4
+	{
+	  "files" : [ {
+	    "path" : "parquet/test/nation.parquet",
+	    "length" : 1210,
+	    "rowGroups" : [ {
+	      "start" : 4,
+	      "length" : 944,
+	      "rowCount" : 25,
+	      "hostAffinity" : {
+	        "localhost" : 1.0
+	      },
+	      "columns" : [ {
+	        "name" : [ "N_NATIONKEY" ],
+	        "nulls" : -1
+	      }, {
+	        "name" : [ "N_NAME" ],
+	        "nulls" : -1
+	      }, {
+	        "name" : [ "N_REGIONKEY" ],
+	        "nulls" : -1
+	      }, {
+	        "name" : [ "N_COMMENT" ],
+	        "nulls" : -1
+	      } ]
+	    } ]
+	  }, {
+	    "path" : "parquet/nation.parquet",
+	    "length" : 1210,
+	    "rowGroups" : [ {
+	      "start" : 4,
+	      "length" : 944,
+	      "rowCount" : 25,
+	      "hostAffinity" : {
+	        "localhost" : 1.0
+	      },
+	      "columns" : [ {
+	        "name" : [ "N_NATIONKEY" ],
+	        "nulls" : -1
+	      }, {
+	        "name" : [ "N_NAME" ],
+	        "nulls" : -1
+	      }, {
+	        "name" : [ "N_REGIONKEY" ],
+	        "nulls" : -1
+	      }, {
+	        "name" : [ "N_COMMENT" ],
+	        "nulls" : -1
+	      } ]
+	    } ]
+	  } ]
 
-**Content of the metadata cache file in the directory named `parquet` that contains the nation.parquet file and subdirectory `dir1`.**  
 
+Looking at the `.drill.parquet_summary_metadata.v4` file in the `parquet1` directory, you can see information about each of the columns in the files and the list of subdirectories and interesting columns (useful when indicating columns in the REFRESH TABLE METADATA command):  
 
-	[root@doc23 parquet]# cat .drill.parquet_metadata
+	[root@doc23 parquet1]# cat .drill.parquet_summary_metadata.v4
 	{
-	  "metadata_version" : "3.3",
 	  "columnTypeInfo" : {
 	    "`N_COMMENT`" : {
 	      "name" : [ "N_COMMENT" ],
@@ -132,7 +217,9 @@ The following sections compare the content of the metadata cache file in  the `p
 	      "precision" : 0,
 	      "scale" : 0,
 	      "repetitionLevel" : 0,
-	      "definitionLevel" : 0
+	      "definitionLevel" : 0,
+	      "totalNullCount" : -1,
+	      "isInteresting" : true
 	    },
 	    "`N_NATIONKEY`" : {
 	      "name" : [ "N_NATIONKEY" ],
@@ -141,25 +228,9 @@ The following sections compare the content of the metadata cache file in  the `p
 	      "precision" : 0,
 	      "scale" : 0,
 	      "repetitionLevel" : 0,
-	      "definitionLevel" : 0
-	    },
-	    "`R_REGIONKEY`" : {
-	      "name" : [ "R_REGIONKEY" ],
-	      "primitiveType" : "INT64",
-	      "originalType" : null,
-	      "precision" : 0,
-	      "scale" : 0,
-	      "repetitionLevel" : 0,
-	      "definitionLevel" : 0
-	    },
-	    "`R_COMMENT`" : {
-	      "name" : [ "R_COMMENT" ],
-	      "primitiveType" : "BINARY",
-	      "originalType" : "UTF8",
-	      "precision" : 0,
-	      "scale" : 0,
-	      "repetitionLevel" : 0,
-	      "definitionLevel" : 0
+	      "definitionLevel" : 0,
+	      "totalNullCount" : -1,
+	      "isInteresting" : true
 	    },
 	    "`N_REGIONKEY`" : {
 	      "name" : [ "N_REGIONKEY" ],
@@ -168,16 +239,9 @@ The following sections compare the content of the metadata cache file in  the `p
 	      "precision" : 0,
 	      "scale" : 0,
 	      "repetitionLevel" : 0,
-	      "definitionLevel" : 0
-	    },
-	    "`R_NAME`" : {
-	      "name" : [ "R_NAME" ],
-	      "primitiveType" : "BINARY",
-	      "originalType" : "UTF8",
-	      "precision" : 0,
-	      "scale" : 0,
-	      "repetitionLevel" : 0,
-	      "definitionLevel" : 0
+	      "definitionLevel" : 0,
+	      "totalNullCount" : -1,
+	      "isInteresting" : true
 	    },
 	    "`N_NAME`" : {
 	      "name" : [ "N_NAME" ],
@@ -186,97 +250,44 @@ The following sections compare the content of the metadata cache file in  the `p
 	      "precision" : 0,
 	      "scale" : 0,
 	      "repetitionLevel" : 0,
-	      "definitionLevel" : 0
+	      "definitionLevel" : 0,
+	      "totalNullCount" : -1,
+	      "isInteresting" : true
 	    }
 	  },
-	  "files" : [ {
-	    "path" : "dir1/region.parquet",
-	    "length" : 455,
-	    "rowGroups" : [ {
-	      "start" : 4,
-	      "length" : 250,
-	      "rowCount" : 5,
-	      "hostAffinity" : {
-	        "localhost" : 1.0
-	      },
-	      "columns" : [ ]
-	    } ]
-	  }, {
-	    "path" : "nation.parquet",
-	    "length" : 1210,
-	    "rowGroups" : [ {
-	      "start" : 4,
-	      "length" : 944,
-	      "rowCount" : 25,
-	      "hostAffinity" : {
-	        "localhost" : 1.0
-	      },
-	      "columns" : [ ]
-	    } ]
-	  } ],
-	  "directories" : [ "dir1" ],
-	  "drillVersion" : "1.16.0-SNAPSHOT"  
+	  "directories" : [ "parquet/test", "parquet" ],
+	  "drillVersion" : "1.16.0-SNAPSHOT",
+	  "totalRowCount" : 50,
+	  "allColumnsInteresting" : true,
+	  "metadata_version" : "4"  
 
-**Content of the directory named `dir1` that contains the `region.parquet` file and no subdirectories.**  
-
-	[root@doc23 dir1]# cat .drill.parquet_metadata
-	{
-	  "metadata_version" : "3.3",
-	  "columnTypeInfo" : {
-	   	"`R_REGIONKEY`" : {
-	   	"name" : [ "R_REGIONKEY" ],
-	   	"primitiveType" : "INT64",
-	   	"originalType" : null,
-	   	"precision" : 0,
-	   	"scale" : 0,
-	   	"repetitionLevel" : 0,
-	   	"definitionLevel" : 0
-	   	},
-	   	"`R_COMMENT`" : {
-	   	"name" : [ "R_COMMENT" ],
-	   	"primitiveType" : "BINARY",
-	   	"originalType" : "UTF8",
-	   	"precision" : 0,
-	   	"scale" : 0,
-	   	"repetitionLevel" : 0,
-	   	"definitionLevel" : 0
-	   	},
-	   	"`R_NAME`" : {
-	   	"name" : [ "R_NAME" ],
-	   	"primitiveType" : "BINARY",
-	      "originalType" : "UTF8",
-	   	"precision" : 0,
-	   	"scale" : 0,
-	   	"repetitionLevel" : 0,
-	   	"definitionLevel" : 0
-	   	}
-	  },
-	  "files" : [ {
-	   	"path" : "region.parquet",
-	   	"length" : 455,
-	   	"rowGroups" : [ {
-	   	"start" : 4,
-	   	"length" : 250,
-	   	"rowCount" : 5,
-	   	"hostAffinity" : {
-	   	"localhost" : 1.0
-	   	},
-	   	"columns" : [ ]
-	   	} ]
-	  } ],
-	  "directories" : [ ],
-	  "drillVersion" : "1.16.0-SNAPSHOT"
-	}  
-
-### Verifying that the Planner is Using the Metadata Cache File 
+###Verifying that the Planner is Using the Metadata Cache or Summary Files
 
 When the planner uses metadata cache files, the query plan includes the `usedMetadataFile` flag. You can access the query plan in the Drill Web UI, by clicking on the query in the Profiles page, or by running the EXPLAIN PLAN FOR command, as shown:
 
-	EXPLAIN PLAN FOR SELECT * FROM parquet;  
- 
+	apache drill (dfs.samples)> explain plan for select * from parquet1;
+	+----------------------------------------------------------------------------------+----------------------------------------------------------------------------------+
+	|                                       text                                       |                                       json                                       |
+	+----------------------------------------------------------------------------------+----------------------------------------------------------------------------------+
 	| 00-00    Screen
 	00-01      Project(**=[$0])
-	00-02      Scan(table=[[dfs, samples, parquet]], groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=/home/parquet]], selectionRoot=/home/parquet, numFiles=1, numRowGroups=2, usedMetadataFile=true, cacheFileRoot=/home/parquet, columns=[`**`]]])
-	|... 
+	00-02        Scan(table=[[dfs, samples, parquet1]], groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=/tmp/parquet1]], selectionRoot=/tmp/parquet1, numFiles=1, numRowGroups=2, usedMetadataFile=true, cacheFileRoot=/tmp/parquet1, columns=[`**`]]])  
+	 |   
+
+When you run the EXPLAIN command with a COUNT() query, as shown, you can see that the query planner uses the summary cache file and avoids reading the larger metadata cache file. The query plan includes the `usedMetadataSummaryFile` flag.
+
+	apache drill (dfs.samples)> explain plan for select count(*) from parquet1;
+	+----------------------------------------------------------------------------------+----------------------------------------------------------------------------------+
+	|                                       text                                       |                                       json                                       |
+	+----------------------------------------------------------------------------------+----------------------------------------------------------------------------------+
+	| 00-00    Screen
+	00-01      Project(EXPR$0=[$0])
+	00-02        DirectScan(groupscan=[files = [file:/tmp/parquet1/.drill.parquet_summary_metadata.v4], numFiles = 1, usedMetadataSummaryFile = true, DynamicPojoRecordReader{records = [[50]]}])
+	 | 
+
+	
+	
+
+
+
 
--->	
diff --git a/_docs/sql-reference/sql-commands/021-create-schema.md b/_docs/sql-reference/sql-commands/021-create-schema.md
index f236735..21ee180 100644
--- a/_docs/sql-reference/sql-commands/021-create-schema.md
+++ b/_docs/sql-reference/sql-commands/021-create-schema.md
@@ -1,10 +1,10 @@
 ---
 title: "CREATE OR REPLACE SCHEMA"
-date: 2019-04-25
+date: 2019-04-29
 parent: "SQL Commands"
 ---
 
-Starting in Drill 1.16, you can define a schema for text files using the CREATE OR REPLACE SCHEMA command. Running this command generates a hidden .drill.schema file in the table’s root directory. The .drill.schema file stores the schema definition in JSON format. Drill uses the schema file at runtime if the exec.storage.enable_v3_text_reader and store.table.use_schema_file options are enabled. Alternatively, you can create the schema file manually. When created manually, the file conten [...]
+Starting in Drill 1.16, you can define a schema for text files using the CREATE OR REPLACE SCHEMA command. Running this command generates a hidden `.drill.schema` file in the table’s root directory. The `.drill.schema` file stores the schema definition in JSON format. Drill uses the schema file at runtime if the `exec.storage.enable_v3_text_reader` and `store.table.use_schema_file` options are enabled. Alternatively, you can create the schema file manually. If created manually, the file  [...]
 
 ##Syntax
 
@@ -187,7 +187,7 @@ Values are trimmed when converting to any type, except for varchar.
 ### Schema Mode (Column Order)
 The schema mode determines the ordering of columns returned for wildcard (*) queries. The mode is set through the `drill.strict` property. You can set this property to true (strict) or false (not strict). If you do not indicate the mode, the default is false (not strict).  
 
-**Not Strict (Default)**
+**Not Strict (Default)**  
 Columns defined in the schema are projected in the defined order. Columns not defined in the schema are appended to the defined columns, as shown:  
 
 	create or replace schema (id int, start_date date format 'yyyy-MM-dd') for table dfs.tmp.`text_table` properties ('drill.strict' = 'false');
@@ -210,7 +210,7 @@ Columns defined in the schema are projected in the defined order. Columns not de
  
 Note that the “name” column, which was not included in the schema was appended to the end of the table.
 
-**Strict**
+**Strict**  
 Setting the `drill.strict` property  to “true” changes the schema mode to strict, which means that the reader ignores any columns NOT included in the schema. The query only returns the columns defined in the schema, as shown:
  
 	create or replace schema (id int, start_date date format 'yyyy-MM-dd') for table dfs.tmp.`text_table` properties ('drill.strict' = 'true');