You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@drill.apache.org by kr...@apache.org on 2015/12/16 02:59:01 UTC
drill git commit: 1.4 update

Repository: drill
Updated Branches:
  refs/heads/gh-pages c61c47fe6 -> 7aa38e042


1.4 update

case/cast example per vicki

DRILL-3949


Project: http://git-wip-us.apache.org/repos/asf/drill/repo
Commit: http://git-wip-us.apache.org/repos/asf/drill/commit/7aa38e04
Tree: http://git-wip-us.apache.org/repos/asf/drill/tree/7aa38e04
Diff: http://git-wip-us.apache.org/repos/asf/drill/diff/7aa38e04

Branch: refs/heads/gh-pages
Commit: 7aa38e042150e7ef294792dff3b7dc79f4aa6906
Parents: c61c47f
Author: Kris Hahn <kr...@apache.org>
Authored: Tue Dec 15 13:25:53 2015 -0800
Committer: Kris Hahn <kr...@apache.org>
Committed: Tue Dec 15 17:56:07 2015 -0800

----------------------------------------------------------------------
 .../010-configuration-options-introduction.md       | 10 +++++++---
 .../020-storage-plugin-registration.md              |  4 ++--
 .../060-text-files-csv-tsv-psv.md                   | 16 +++++++++++++++-
 3 files changed, 24 insertions(+), 6 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/drill/blob/7aa38e04/_docs/configure-drill/configuration-options/010-configuration-options-introduction.md
----------------------------------------------------------------------
diff --git a/_docs/configure-drill/configuration-options/010-configuration-options-introduction.md b/_docs/configure-drill/configuration-options/010-configuration-options-introduction.md
index 6aa9017..8282740 100644
--- a/_docs/configure-drill/configuration-options/010-configuration-options-introduction.md
+++ b/_docs/configure-drill/configuration-options/010-configuration-options-introduction.md
@@ -16,8 +16,9 @@ The sys.options table lists the following options that you can set as a system o
 
 | Name                                           | Default          | Comments                                                                                                                                                                                                                                                                                                                                                         |
 |------------------------------------------------|------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| drill.exec.functions.cast_empty_string_to_null | FALSE            | Not supported in this release.                                                                                                                                                                                                                                                                                                                                   |
+| drill.exec.functions.cast_empty_string_to_null | FALSE            | In a text file, treat empty fields as NULL values instead of empty string.                                                                                                                                                                                                                                                                                       |
 | drill.exec.storage.file.partition.column.label | dir              | The column label for directory levels in results of queries of files in a directory. Accepts a string input.                                                                                                                                                                                                                                                     |
+| exec.enable_union_type                         | false            | Enable support for Avro union type.                                                                                                                                                                                                                                                                                                                              |
 | exec.errors.verbose                            | FALSE            | Toggles verbose output of executable error messages                                                                                                                                                                                                                                                                                                              |
 | exec.java_compiler                             | DEFAULT          | Switches between DEFAULT, JDK, and JANINO mode for the current session. Uses Janino by default for generated source code of less than exec.java_compiler_janino_maxsize; otherwise, switches to the JDK compiler.                                                                                                                                                |
 | exec.java_compiler_debug                       | TRUE             | Toggles the output of debug-level compiler error messages in runtime generated code.                                                                                                                                                                                                                                                                             |
@@ -59,9 +60,9 @@ The sys.options table lists the following options that you can set as a system o
 | planner.memory.enable_memory_estimation        | FALSE            | Toggles the state of memory estimation and re-planning of the query. When enabled, Drill conservatively estimates memory requirements and typically excludes these operators from the plan and negatively impacts performance.                                                                                                                                   |
 | planner.memory.hash_agg_table_factor           | 1.1              | A heuristic value for influencing the size of the hash aggregation table.                                                                                                                                                                                                                                                                                        |
 | planner.memory.hash_join_table_factor          | 1.1              | A heuristic value for influencing the size of the hash aggregation table.                                                                                                                                                                                                                                                                                        |
-| planner.memory_limit                           | 268435456 bytes  | Defines the maximum amount of direct memory allocated to a query for planning. When multiple queries run concurrently, each query is allocated the amount of memory set by this parameter.Increase the value of this parameter and rerun the query if partition pruning failed due to insufficient memory.                                                       |
 | planner.memory.max_query_memory_per_node       | 2147483648 bytes | Sets the maximum estimate of memory for a query per node in bytes. If the estimate is too low, Drill re-plans the query without memory-constrained operators.                                                                                                                                                                                                    |
 | planner.memory.non_blocking_operators_memory   | 64               | Extra query memory per node for non-blocking operators. This option is currently used only for memory estimation. Range: 0-2048 MB                                                                                                                                                                                                                               |
+| planner.memory_limit                           | 268435456 bytes  | Defines the maximum amount of direct memory allocated to a query for planning. When multiple queries run concurrently, each query is allocated the amount of memory set by this parameter.Increase the value of this parameter and rerun the query if partition pruning failed due to insufficient memory.                                                       |
 | planner.nestedloopjoin_factor                  | 100              | A heuristic value for influencing the nested loop join.                                                                                                                                                                                                                                                                                                          |
 | planner.partitioner_sender_max_threads         | 8                | Upper limit of threads for outbound queuing.                                                                                                                                                                                                                                                                                                                     |
 | planner.partitioner_sender_set_threads         | -1               | Overwrites the number of threads used to send out batches of records. Set to -1 to disable. Typically not changed.                                                                                                                                                                                                                                               |
@@ -70,16 +71,19 @@ The sys.options table lists the following options that you can set as a system o
 | planner.slice_target                           | 100000           | The number of records manipulated within a fragment before Drill parallelizes operations.                                                                                                                                                                                                                                                                        |
 | planner.width.max_per_node                     | 3                | Maximum number of threads that can run in parallel for a query on a node. A slice is an individual thread. This number indicates the maximum number of slices per query for the query’s major fragment on a node.                                                                                                                                                |
 | planner.width.max_per_query                    | 1000             | Same as max per node but applies to the query as executed by the entire cluster. For example, this value might be the number of active Drillbits, or a higher number to return results faster.                                                                                                                                                                   |
+| security.admin.user_groups                     | n/a              | Unsupported as of 1.4. A comma-separated list of administrator groups for Web Console security.                                                                                                                                                                                                                                                                  |
+| security.admin.users                           | <a name>         | Unsupported as of 1.4. A comma-separated list of user names who you want to give administrator privileges.                                                                                                                                                                                                                                                       |
 | store.format                                   | parquet          | Output format for data written to tables with the CREATE TABLE AS (CTAS) command. Allowed values are parquet, json, psv, csv, or tsv.                                                                                                                                                                                                                            |
 | store.hive.optimize_scan_with_native_readers   | FALSE            | Optimize reads of Parquet-backed external tables from Hive by using Drill native readers instead of the Hive Serde interface. (Drill 1.2 and later)                                                                                                                                                                                                              |
 | store.json.all_text_mode                       | FALSE            | Drill reads all data from the JSON files as VARCHAR. Prevents schema change errors.                                                                                                                                                                                                                                                                              |
-| store.json.extended_types                      | FALSE            | Turns on special JSON structures that Drill serializes for storing more type information than the [four basic JSON types](http://docs.mongodb.org/manual/reference/mongodb-extended-json/).                                                                                                                                                                      |
+| store.json.extended_types                      | FALSE            | Turns on special JSON structures that Drill serializes for storing more type information than the four basic JSON types.                                                                                                                                                                                                                                         |
 | store.json.read_numbers_as_double              | FALSE            | Reads numbers with or without a decimal point as DOUBLE. Prevents schema change errors.                                                                                                                                                                                                                                                                          |
 | store.mongo.all_text_mode                      | FALSE            | Similar to store.json.all_text_mode for MongoDB.                                                                                                                                                                                                                                                                                                                 |
 | store.mongo.read_numbers_as_double             | FALSE            | Similar to store.json.read_numbers_as_double.                                                                                                                                                                                                                                                                                                                    |
 | store.parquet.block-size                       | 536870912        | Sets the size of a Parquet row group to the number of bytes less than or equal to the block size of MFS, HDFS, or the file system.                                                                                                                                                                                                                               |
 | store.parquet.compression                      | snappy           | Compression type for storing Parquet output. Allowed values: snappy, gzip, none                                                                                                                                                                                                                                                                                  |
 | store.parquet.enable_dictionary_encoding       | FALSE            | For internal use. Do not change.                                                                                                                                                                                                                                                                                                                                 |
+| store.parquet.dictionary.page-size             | 1048576          |                                                                                                                                                                                                                                                                                                                                                                  |
 | store.parquet.use_new_reader                   | FALSE            | Not supported in this release.                                                                                                                                                                                                                                                                                                                                   |
 | store.partition.hash_distribute                | FALSE            | Uses a hash algorithm to distribute data on partition keys in a CTAS partitioning operation. An alpha option--for experimental use at this stage. Do not use in production systems.                                                                                                                                                                              |
 | store.text.estimated_row_size_bytes            | 100              | Estimate of the row size in a delimited text file, such as csv. The closer to actual, the better the query plan. Used for all csv files in the system/session where the value is set. Impacts the decision to plan a broadcast join or not.                                                                                                                      |

http://git-wip-us.apache.org/repos/asf/drill/blob/7aa38e04/_docs/connect-a-data-source/020-storage-plugin-registration.md
----------------------------------------------------------------------
diff --git a/_docs/connect-a-data-source/020-storage-plugin-registration.md b/_docs/connect-a-data-source/020-storage-plugin-registration.md
index 9dec247..fd4b0ea 100644
--- a/_docs/connect-a-data-source/020-storage-plugin-registration.md
+++ b/_docs/connect-a-data-source/020-storage-plugin-registration.md
@@ -30,9 +30,9 @@ To register a new storage plugin configuration, enter a storage name, click **CR
 
 ## Storage Plugin Configuration Persistance
 
-Drill saves storage plugin configurations in a temporary directory (embedded mode) or in ZooKeeper (distributed mode). For example, on Mac OS X, Drill uses `/tmp/drill/sys.storage_plugins` to store storage plugin configurations. The temporary directory clears when you quit the Drill shell. To save your storage plugin configurations from one session to the next, set the following option in the `drill-override.conf` file if you are running Drill in embedded mode.
+Drill saves storage plugin configurations in a temporary directory (embedded mode) or in ZooKeeper (distributed mode). For example, on Mac OS X, Drill uses `/tmp/drill/sys.storage_plugins` to store storage plugin configurations. The temporary directory clears when you reboot. Copy storage plugin configurations to a secure location to save them when you run drill in embedded mode.
 
-`drill.exec.sys.store.provider.local.path = "/mypath"`
+<!-- `drill.exec.sys.store.provider.local.path = "/mypath"` -->
 
 <!-- Enabling authorization to protect this data through the Web Console and REST API does not include protection for the data in the tmp directory or in ZooKeeper. 
 

http://git-wip-us.apache.org/repos/asf/drill/blob/7aa38e04/_docs/data-sources-and-file-formats/060-text-files-csv-tsv-psv.md
----------------------------------------------------------------------
diff --git a/_docs/data-sources-and-file-formats/060-text-files-csv-tsv-psv.md b/_docs/data-sources-and-file-formats/060-text-files-csv-tsv-psv.md
index ccdfc54..0fdb165 100644
--- a/_docs/data-sources-and-file-formats/060-text-files-csv-tsv-psv.md
+++ b/_docs/data-sources-and-file-formats/060-text-files-csv-tsv-psv.md
@@ -27,7 +27,21 @@ If your text file have headers, you can enable extractHeader and select particul
 
 ### Cast data
 
-You can also improve performance by casting the VARCHAR data to INT, FLOAT, DATETIME, and so on when you read the data from a text file. Drill performs better reading fixed-width than reading VARCHAR data. 
+You can also improve performance by casting the VARCHAR data in a text file to INT, FLOAT, DATETIME, and so on when you read the data from a text file. Drill performs better reading fixed-width than reading VARCHAR data. 
+
+Text files that include empty strings might produce unacceptable results. Common ways to deal with empty strings are:
+
+* Set the drill.exec.functions.cast_empty_string_to_null SESSION/SYSTEM option to true.  
+* Use a case statement to cast empty strings to values you want. For example, create a Parquet table named test from a CSV file named test.csv, and cast empty strings in the CSV to null in any column the empty string appears:  
+
+          CREATE TABLE test AS SELECT
+            case when COLUMNS[0] = '' then CAST(NULL AS INTEGER) else CAST(COLUMNS[0] AS INTEGER) end AS c1,
+            case when COLUMNS[1] = '' then CAST(NULL AS VARCHAR(20)) else CAST(COLUMNS[1] AS VARCHAR(20)) end AS c2,
+            case when COLUMNS[2] = '' then CAST(NULL AS DOUBLE) else CAST(COLUMNS[2] AS DOUBLE) end AS c3,
+            case when COLUMNS[3] = '' then CAST(NULL AS DATE) else CAST(COLUMNS[3] AS DATE) end AS c4,
+            case when COLUMNS[4] = '' then CAST(NULL AS VARCHAR(20)) else CAST(COLUMNS[4] AS VARCHAR(20)) end AS c5
+          FROM `test.csv`; 
+
 
 ### Use a Distributed File System
 Using a distributed file system, such as HDFS, instead of a local file system to query the files also improves performance because currently Drill does not split files on block splits.