You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@drill.apache.org by kr...@apache.org on 2015/12/15 00:48:55 UTC

[04/11] drill git commit: migration tool docs

migration tool docs


Project: http://git-wip-us.apache.org/repos/asf/drill/repo
Commit: http://git-wip-us.apache.org/repos/asf/drill/commit/965bfbf1
Tree: http://git-wip-us.apache.org/repos/asf/drill/tree/965bfbf1
Diff: http://git-wip-us.apache.org/repos/asf/drill/diff/965bfbf1

Branch: refs/heads/gh-pages
Commit: 965bfbf1ace6f5f05793902600a7568111579350
Parents: 161af8f
Author: Kris Hahn <kr...@apache.org>
Authored: Mon Dec 14 10:07:59 2015 -0800
Committer: Kris Hahn <kr...@apache.org>
Committed: Mon Dec 14 15:46:37 2015 -0800

----------------------------------------------------------------------
 .../performance-tuning/020-partition-pruning.md | 118 -------------------
 .../010-partition-pruning-introduction.md       |  21 ++++
 .../020-migrating-partitioned-data.md           |  50 ++++++++
 .../partition-pruning/030-partition-pruning.md  | 111 +++++++++++++++++
 4 files changed, 182 insertions(+), 118 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/drill/blob/965bfbf1/_docs/performance-tuning/020-partition-pruning.md
----------------------------------------------------------------------
diff --git a/_docs/performance-tuning/020-partition-pruning.md b/_docs/performance-tuning/020-partition-pruning.md
old mode 100644
new mode 100755
index 26f681f..ddf67eb
--- a/_docs/performance-tuning/020-partition-pruning.md
+++ b/_docs/performance-tuning/020-partition-pruning.md
@@ -2,121 +2,3 @@
 title: "Partition Pruning"
 parent: "Performance Tuning"
 --- 
-
-Partition pruning is a performance optimization that limits the number of files and partitions that Drill reads when querying file systems and Hive tables. When you partition data, Drill only reads a subset of the files that reside in a file system or a subset of the partitions in a Hive table when a query matches certain filter criteria.
- 
-The query planner in Drill performs partition pruning by evaluating the filters. If no partition filters are present, the underlying Scan operator reads all files in all directories and then sends the data to operators, such as Filter, downstream. When partition filters are present, the query planner pushes the filters down to the Scan if possible. The Scan reads only the directories that match the partition filters, thus reducing disk I/O.
-
-## Migrating Partitioned Data from Drill 1.1-1.2 to Drill 1.3
-Use the [drill-upgrade tool](https://github.com/parthchandra/drill-upgrade) to migrate Parquet data that you generated in Drill 1.1 or 1.2 before attempting to use the data with Drill 1.3 partition pruning.  This migration is mandatory because Parquet data generated by Drill 1.1 and 1.2 must be marked as Drill-generated, as described in [DRILL-4070](https://issues.apache.org/jira/browse/DRILL-4070). 
-
-Drill 1.3 fixes a bug to accurately process Parquet files produced by other tools, such as Pig and Hive. The bug fix eliminated the risk of inaccurate metadata that could cause incorrect results when querying Hive- and Pig-generated Parquet files. No such risk exists with Drill-generated Parquet files. Querying Drill-generated Parquet files, regardless of the Drill version, yields accurate results. Drill-generated Parquet files, regardless of the Drill release, contain accurate metadata.
-
-After using the drill-upgrade tool to migrate your partitioned, pre-1.3 Parquet data, Drill can distinguish these files from those generated by other tools, such as Hive and Pig. Use the migration tool only on files generated by Drill. 
-
-To partition and query Parquet files generated from other tools, use Drill to read and rewrite the files and metadata using the CTAS command with the PARTITION BY clause. Alternatively, use the tool that generated the original files to regenerate Parquet 1.8 or later files.
-
-## How to Partition Data
-
-In Drill 1.1.0 and later, if the data source is Parquet, no data organization tasks are required to take advantage of partition pruning. Write Parquet data using the [PARTITION BY]({{site.baseurl}}/docs/partition-by-clause/) clause in the CTAS statement. 
-
-The Parquet writer first sorts data by the partition keys, and then creates a new file when it encounters a new value for the partition columns. During partitioning, Drill creates separate files, but not separate directories, for different partitions. Each file contains exactly one partition value, but there can be multiple files for the same partition value. 
-
-Partition pruning uses the Parquet column statistics to determine which columns to use to prune. 
-
-Unlike using the Drill 1.0 partitioning, no view query is subsequently required, nor is it necessary to use the [dir* variables]({{site.baseurl}}/docs/querying-directories) after you use the Drill 1.1 PARTITION BY clause in a CTAS statement. 
-
-## Drill 1.0 Partitioning
-
-You perform the following steps to partition data in Drill 1.0.   
- 
-1. Devise a logical way to store the data in a hierarchy of directories. 
-2. Use CTAS to create Parquet files from the original data, specifying filter conditions.
-3. Move the files into directories in the hierarchy. 
-
-After partitioning the data, you need to create a view of the partitioned data to query the data. You can use the [dir* variables]({{site.baseurl}}/docs/querying-directories) in queries to refer to subdirectories in your workspace path.
- 
-### Drill 1.0 Partitioning Example
-
-Suppose you have text files containing several years of log data. To partition the data by year and quarter, create the following hierarchy of directories:  
-       
-       …/logs/1994/Q1  
-       …/logs/1994/Q2  
-       …/logs/1994/Q3  
-       …/logs/1994/Q4  
-       …/logs/1995/Q1  
-       …/logs/1995/Q2  
-       …/logs/1995/Q3  
-       …/logs/1995/Q4  
-       …/logs/1996/Q1  
-       …/logs/1996/Q2  
-       …/logs/1996/Q3  
-       …/logs/1996/Q4  
-
-Run the following CTAS statement, filtering on the Q1 1994 data.
- 
-          CREATE TABLE TT_1994_Q1 
-              AS SELECT * FROM <raw table data in text format >
-              WHERE columns[1] = 1994 AND columns[2] = 'Q1'
- 
-This creates a Parquet file with the log data for Q1 1994 in the current workspace.  You can then move the file into the correlating directory, and repeat the process until all of the files are stored in their respective directories.
-
-Now you can define views on the parquet files and query the views.  
-
-       0: jdbc:drill:zk=local> create view vv1 as select `dir0` as `year`, `dir1` as `qtr` from dfs.`/Users/max/data/multilevel/parquet`;
-       +------------+------------+
-       |     ok     |  summary   |
-       +------------+------------+
-       | true       | View 'vv1' created successfully in 'dfs.tmp' schema |
-       +------------+------------+
-       1 row selected (0.16 seconds)  
-
-Query the view to see all of the logs.  
-
-       0: jdbc:drill:zk=local> select * from dfs.tmp.vv1;
-       +------------+------------+
-       |    year    |    qtr     |
-       +------------+------------+
-       | 1994       | Q1         |
-       | 1994       | Q3         |
-       | 1994       | Q3         |
-       | 1994       | Q4         |
-       | 1994       | Q4         |
-       | 1994       | Q4         |
-       | 1994       | Q4         |
-       | 1995       | Q2         |
-       | 1995       | Q2         |
-       | 1995       | Q2         |
-       | 1995       | Q2         |
-       | 1995       | Q4         |
-       | 1995       | Q4         |
-       | 1995       | Q4         |
-       | 1995       | Q4         |
-       | 1995       | Q4         |
-       | 1995       | Q4         |
-       | 1995       | Q4         |
-       | 1996       | Q1         |
-       | 1996       | Q1         |
-       | 1996       | Q1         |
-       | 1996       | Q1         |
-       | 1996       | Q1         |
-       | 1996       | Q2         |
-       | 1996       | Q3         |
-       | 1996       | Q3         |
-       | 1996       | Q3         |
-       +------------+------------+
-       ...
-
-
-When you query the view, Drill can apply partition pruning and read only the files and directories required to return query results.
-
-       0: jdbc:drill:zk=local> explain plan for select * from dfs.tmp.vv1 where `year` = 1996 and qtr = 'Q2';
-       +------------+------------+
-       |    text    |    json    |
-       +------------+------------+
-       | 00-00    Screen
-       00-01      Project(year=[$0], qtr=[$1])
-       00-02        Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=file:/Users/maxdata/multilevel/parquet/1996/Q2/orders_96_q2.parquet]], selectionRoot=/Users/max/data/multilevel/parquet, numFiles=1, columns=[`dir0`, `dir1`]]])
-       
-
-

http://git-wip-us.apache.org/repos/asf/drill/blob/965bfbf1/_docs/performance-tuning/partition-pruning/010-partition-pruning-introduction.md
----------------------------------------------------------------------
diff --git a/_docs/performance-tuning/partition-pruning/010-partition-pruning-introduction.md b/_docs/performance-tuning/partition-pruning/010-partition-pruning-introduction.md
new file mode 100755
index 0000000..77c16d8
--- /dev/null
+++ b/_docs/performance-tuning/partition-pruning/010-partition-pruning-introduction.md
@@ -0,0 +1,21 @@
+---
+title: "Partition Pruning Introduction"
+parent: "Partition Pruning"
+--- 
+
+Partition pruning is a performance optimization that limits the number of files and partitions that Drill reads when querying file systems and Hive tables. When you partition data, Drill only reads a subset of the files that reside in a file system or a subset of the partitions in a Hive table when a query matches certain filter criteria.
+
+The query planner in Drill performs partition pruning by evaluating the filters. If no partition filters are present, the underlying Scan operator reads all files in all directories and then sends the data to operators, such as Filter, downstream. When partition filters are present, the query planner pushes the filters down to the Scan if possible. The Scan reads only the directories that match the partition filters, thus reducing disk I/O.
+
+## Using Partitioned Drill 1.1-1.2 Data
+Before using partitioned Drill 1.1-1.2 data in Drill 1.3, you need to migrate the data. Migrate Parquet data as described in "Migrating Partitioned Data". 
+
+{% include startimportant.html %}Migrate only Parquet files that Drill generated.{% include endimportant.html %}
+
+## Partitioning Data
+Prior to the release of Drill 1.1, partition pruning involved time-consuming manual setup tasks. Using the PARTITION BY clause in the CTAS command simplifies the process. "How to Partition Data" describes this process.
+
+
+
+
+

http://git-wip-us.apache.org/repos/asf/drill/blob/965bfbf1/_docs/performance-tuning/partition-pruning/020-migrating-partitioned-data.md
----------------------------------------------------------------------
diff --git a/_docs/performance-tuning/partition-pruning/020-migrating-partitioned-data.md b/_docs/performance-tuning/partition-pruning/020-migrating-partitioned-data.md
new file mode 100755
index 0000000..d3ddcc8
--- /dev/null
+++ b/_docs/performance-tuning/partition-pruning/020-migrating-partitioned-data.md
@@ -0,0 +1,50 @@
+---
+title: "Migrating Partitioned Data"
+parent: "Performance Pruning Introduction"
+--- 
+
+Migrating Parquet data that you partitioned and generated using Drill 1.1 and 1.2 is mandatory before using the data in Drill 1.3. The data in must be marked as Drill-generated. Use the [drill-upgrade tool](https://github.com/parthchandra/drill-upgrade) to migrate Parquet data that you partitioned and generated in Drill 1.1 or 1.2. 
+
+{% include startimportant.html %} Run the upgrade tool only on Drill-generated Parquet files. {% include endimportant.html %}
+
+<!-- as described in [DRILL-4070](https://issues.apache.org/jira/browse/DRILL-4070).  -->
+
+## Why Migrate Drill 1.1-1.2 Data
+Parquet data partitioning became available in Drill 1.1 with the introduction of the PARTITION BY clause of the CTAS command. Drill 1.3 uses the latest (as of the 1.3 release date) Apache Parquet Library from when generating and partitioning Parquet files, whereas Drill 1.1 and 1.2 use a version of a previous Parquet Library created by the Drill team. The Drill team fixed a bug in the previous Library to accurately process Parquet files generated by other tools, such as Impala and Hive. Apache Parquet fixed the bug in the latest Library, making it suitable for use in Drill 1.3. Drill now uses the same Apache Parquet Library as Impala, Hive, and other software. You need to run the upgrade tool on Parquet files generated by Drill 1.1 and 1.2 that used the previous Library. 
+
+The upgrade tool simply inserts a version number in the metadata to mark the file as a Drill file. 
+
+<!-- The bug fix eliminated the risk of inaccurate metadata that could cause incorrect results when querying Hive- and Pig-generated Parquet files. No such risk exists with Drill-generated Parquet files. Querying Drill-generated Parquet files, regardless of the Drill version, yields accurate results. Drill-generated Parquet files, regardless of the Drill release, contain accurate metadata. -->
+
+## How to Migrate Data
+Use the [drill-upgrade tool](https://github.com/parthchandra/drill-upgrade)to modify one file at a time. The temp directory holds a copy of the file that is currently being modified for recovery in the event of a system failure. 
+
+System administrators can write a shell script to run the upgrade tool simultaneously on multiple sub-directories.
+
+## Preparing for the Migration
+In a test by Drill developers, it took 32 minutes to upgrade 1TB data in 840 files and
+370 minutes to upgrade 100 GB data in 200k files. Although the size of files is a factor in the upgrade time, the number of files is the most significant factor.
+
+To migrate Parquet data for use in Drill 1.3 that you partitioned and generated in Drill 1.1 or 1.2, follow these steps:
+
+{% include startimportant.html %} Run the upgrade tool only on Drill-generated Parquet files. {% include endimportant.html %}
+
+1. Back up the data to be migrated.  
+2. Create one or more temp directories, depending on how you plan to run the upgrade tool, on the same file system as the data.  
+   For example, if the data is on HDFS, create the temp directory on HDFS.
+   Create distinct temp directories when you run the upgrade tool simultaneously on multiple directories as different directories can have files with same names.  
+3. Access the upgrade tool at TBD.  
+4. If you use [Parquet metadata caching]({{site.baseurl}}/docs/optimizing-parquet-metadata-reading/#how-to-trigger-generation-of-the-parquet-metadata-cache-file):  
+   * Delete the cache file you generated from all directories and subdirectories where you plan to run the upgrade tool.  
+   * Run REFRESH TABLE METADATA on all the folders where a cache file previously existed.  
+5. Run the upgrade tool as shown in the following example:  
+   `java -Dlog.path=/home/rchallapalli/work/drill-upgrade/upgrade.log -cp drill-upgrade-1.0-jar-with-dependencies.jar org.apache.drill.upgrade.Upgrade_12_13 --tempDir=maprfs:///drill/upgrade-temp maprfs:///drill/testdata/`
+
+## Checking the Success of the Migration
+
+## Handling of Migration Failure
+
+If a network connection goes down, or if a user cancels the operation, the file that was being processed at the time of cancellation could be corrupted. So we should always copy the file back from the temp directory. Now if we re-run the upgrade tool, it will skip the files that it has already processed and only updates the remaining files.
+
+
+

http://git-wip-us.apache.org/repos/asf/drill/blob/965bfbf1/_docs/performance-tuning/partition-pruning/030-partition-pruning.md
----------------------------------------------------------------------
diff --git a/_docs/performance-tuning/partition-pruning/030-partition-pruning.md b/_docs/performance-tuning/partition-pruning/030-partition-pruning.md
new file mode 100755
index 0000000..e376d5d
--- /dev/null
+++ b/_docs/performance-tuning/partition-pruning/030-partition-pruning.md
@@ -0,0 +1,111 @@
+---
+title: "Partition Pruning"
+parent: "Partition Pruning"
+--- 
+
+In Drill 1.1.0 and later, if the data source is Parquet, no data organization tasks are required to take advantage of partition pruning. To partition and query Parquet files generated from other tools, use Drill to read and rewrite the files and metadata using the CTAS command with the PARTITION BY clause, as described in the following section "How to Partition Data".
+
+## How to Partition Data
+
+Write Parquet data using the [PARTITION BY]({{site.baseurl}}/docs/partition-by-clause/) clause in the CTAS statement. 
+
+The Parquet writer first sorts data by the partition keys, and then creates a new file when it encounters a new value for the partition columns. During partitioning, Drill creates separate files, but not separate directories, for different partitions. Each file contains exactly one partition value, but there can be multiple files for the same partition value. 
+
+Partition pruning uses the Parquet column statistics to determine which columns to use to prune. 
+
+Unlike using the Drill 1.0 partitioning, no view query is subsequently required, nor is it necessary to use the [dir* variables]({{site.baseurl}}/docs/querying-directories) after you use the PARTITION BY clause in a CTAS statement. 
+
+## Drill 1.0 Partitioning
+
+Drill 1.0 does not support the PARTITION BY clause of the CTAS command supported by later versions. Partitioning Drill 1.0-generated data involves performing the following steps.   
+ 
+1. Devise a logical way to store the data in a hierarchy of directories. 
+2. Use CTAS to create Parquet files from the original data, specifying filter conditions.
+3. Move the files into directories in the hierarchy. 
+
+After partitioning the data, you need to create a view of the partitioned data to query the data. You can use the [dir* variables]({{site.baseurl}}/docs/querying-directories) in queries to refer to subdirectories in your workspace path.
+ 
+### Drill 1.0 Partitioning Example
+
+Suppose you have text files containing several years of log data. To partition the data by year and quarter, create the following hierarchy of directories:  
+       
+       …/logs/1994/Q1  
+       …/logs/1994/Q2  
+       …/logs/1994/Q3  
+       …/logs/1994/Q4  
+       …/logs/1995/Q1  
+       …/logs/1995/Q2  
+       …/logs/1995/Q3  
+       …/logs/1995/Q4  
+       …/logs/1996/Q1  
+       …/logs/1996/Q2  
+       …/logs/1996/Q3  
+       …/logs/1996/Q4  
+
+Run the following CTAS statement, filtering on the Q1 1994 data.
+ 
+          CREATE TABLE TT_1994_Q1 
+              AS SELECT * FROM <raw table data in text format >
+              WHERE columns[1] = 1994 AND columns[2] = 'Q1'
+ 
+This creates a Parquet file with the log data for Q1 1994 in the current workspace.  You can then move the file into the correlating directory, and repeat the process until all of the files are stored in their respective directories.
+
+Now you can define views on the parquet files and query the views.  
+
+       0: jdbc:drill:zk=local> create view vv1 as select `dir0` as `year`, `dir1` as `qtr` from dfs.`/Users/max/data/multilevel/parquet`;
+       +------------+------------+
+       |     ok     |  summary   |
+       +------------+------------+
+       | true       | View 'vv1' created successfully in 'dfs.tmp' schema |
+       +------------+------------+
+       1 row selected (0.16 seconds)  
+
+Query the view to see all of the logs.  
+
+       0: jdbc:drill:zk=local> select * from dfs.tmp.vv1;
+       +------------+------------+
+       |    year    |    qtr     |
+       +------------+------------+
+       | 1994       | Q1         |
+       | 1994       | Q3         |
+       | 1994       | Q3         |
+       | 1994       | Q4         |
+       | 1994       | Q4         |
+       | 1994       | Q4         |
+       | 1994       | Q4         |
+       | 1995       | Q2         |
+       | 1995       | Q2         |
+       | 1995       | Q2         |
+       | 1995       | Q2         |
+       | 1995       | Q4         |
+       | 1995       | Q4         |
+       | 1995       | Q4         |
+       | 1995       | Q4         |
+       | 1995       | Q4         |
+       | 1995       | Q4         |
+       | 1995       | Q4         |
+       | 1996       | Q1         |
+       | 1996       | Q1         |
+       | 1996       | Q1         |
+       | 1996       | Q1         |
+       | 1996       | Q1         |
+       | 1996       | Q2         |
+       | 1996       | Q3         |
+       | 1996       | Q3         |
+       | 1996       | Q3         |
+       +------------+------------+
+       ...
+
+
+When you query the view, Drill can apply partition pruning and read only the files and directories required to return query results.
+
+       0: jdbc:drill:zk=local> explain plan for select * from dfs.tmp.vv1 where `year` = 1996 and qtr = 'Q2';
+       +------------+------------+
+       |    text    |    json    |
+       +------------+------------+
+       | 00-00    Screen
+       00-01      Project(year=[$0], qtr=[$1])
+       00-02        Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=file:/Users/maxdata/multilevel/parquet/1996/Q2/orders_96_q2.parquet]], selectionRoot=/Users/max/data/multilevel/parquet, numFiles=1, columns=[`dir0`, `dir1`]]])
+       
+
+