You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@drill.apache.org by ts...@apache.org on 2015/05/30 07:03:28 UTC

[02/26] drill git commit: DRILL-3179

DRILL-3179


Project: http://git-wip-us.apache.org/repos/asf/drill/repo
Commit: http://git-wip-us.apache.org/repos/asf/drill/commit/c6be6cc5
Tree: http://git-wip-us.apache.org/repos/asf/drill/tree/c6be6cc5
Diff: http://git-wip-us.apache.org/repos/asf/drill/diff/c6be6cc5

Branch: refs/heads/gh-pages
Commit: c6be6cc507faa0bf653c7d7ba8953e168476125b
Parents: 55f2549
Author: Kristine Hahn <kh...@maprtech.com>
Authored: Tue May 26 14:39:52 2015 -0700
Committer: Kristine Hahn <kh...@maprtech.com>
Committed: Tue May 26 14:39:52 2015 -0700

----------------------------------------------------------------------
 .../020-configuring-drill-memory.md             |   3 +
 .../060-configuring-a-shared-drillbit.md        |   3 +
 .../010-configuration-options-introduction.md   |   9 +-
 .../035-plugin-configuration-introduction.md    |   4 +-
 .../050-json-data-model.md                      | 106 ++++++++-----------
 .../030-querying-plain-text-files.md            |  22 ++--
 6 files changed, 64 insertions(+), 83 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/drill/blob/c6be6cc5/_docs/configure-drill/020-configuring-drill-memory.md
----------------------------------------------------------------------
diff --git a/_docs/configure-drill/020-configuring-drill-memory.md b/_docs/configure-drill/020-configuring-drill-memory.md
index 5948bcc..30d5121 100644
--- a/_docs/configure-drill/020-configuring-drill-memory.md
+++ b/_docs/configure-drill/020-configuring-drill-memory.md
@@ -38,3 +38,6 @@ The `drill-env.sh` file contains the following options:
 * Xmx specifies the maximum memory allocation pool for a Java Virtual Machine (JVM). 
 * Xms specifies the initial memory allocation pool.
 
+If performance is an issue, replace the -ea flag with -Dbounds=false, as shown in the following example:
+
+    export DRILL_JAVA_OPTS="-Xms1G -Xmx$DRILL_MAX_HEAP -XX:MaxDirectMemorySize=$DRILL_MAX_DIRECT_MEMORY -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=1G -Dbounds=false"
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/drill/blob/c6be6cc5/_docs/configure-drill/060-configuring-a-shared-drillbit.md
----------------------------------------------------------------------
diff --git a/_docs/configure-drill/060-configuring-a-shared-drillbit.md b/_docs/configure-drill/060-configuring-a-shared-drillbit.md
index 1070586..2d31cef 100644
--- a/_docs/configure-drill/060-configuring-a-shared-drillbit.md
+++ b/_docs/configure-drill/060-configuring-a-shared-drillbit.md
@@ -10,6 +10,9 @@ Set [options in sys.options]({{site.baseurl}}/docs/configuration-options-introdu
 
 * exec.queue.large  
 * exec.queue.small  
+* exec.queue.threshold
+
+The exec.queue.threshold sets the cost threshold for determining whether query is large or small based on complexity. Complex queries have higher thresholds. The default, 30,000,000, represents the estimated rows that a query will process. To serialize incoming queries, set the small queue at 0 and the threshold at 0.
 
 For more information, see the section, ["Performance Tuning"](/docs/performance-tuning-introduction/).
 

http://git-wip-us.apache.org/repos/asf/drill/blob/c6be6cc5/_docs/configure-drill/configuration-options/010-configuration-options-introduction.md
----------------------------------------------------------------------
diff --git a/_docs/configure-drill/configuration-options/010-configuration-options-introduction.md b/_docs/configure-drill/configuration-options/010-configuration-options-introduction.md
index bdd19f3..524ff67 100644
--- a/_docs/configure-drill/configuration-options/010-configuration-options-introduction.md
+++ b/_docs/configure-drill/configuration-options/010-configuration-options-introduction.md
@@ -24,10 +24,10 @@ The sys.options table lists the following options that you can set as a system o
 | exec.java_compiler_janino_maxsize              | 262144           | See the exec.java_compiler option comment. Accepts inputs of type LONG.                                                                                                                                                                                                                                                                                          |
 | exec.max_hash_table_size                       | 1073741824       | Ending size for hash tables. Range: 0 - 1073741824.                                                                                                                                                                                                                                                                                                              |
 | exec.min_hash_table_size                       | 65536            | Starting size for hash tables. Increase according to available memory to improve performance. Increasing for very large aggregations or joins when you have large amounts of memory for Drill to use. Range: 0 - 1073741824.                                                                                                                                     |
-| exec.queue.enable                              | FALSE            | Changes the state of query queues to control the number of queries that run simultaneously.                                                                                                                                                                                                                                                                      |
+| exec.queue.enable                              | FALSE            | Changes the state of query queues. False allows unlimited concurrent queries.                                                                                                                                                                                                                                                                                    |
 | exec.queue.large                               | 10               | Sets the number of large queries that can run concurrently in the cluster. Range: 0-1000                                                                                                                                                                                                                                                                         |
 | exec.queue.small                               | 100              | Sets the number of small queries that can run concurrently in the cluster. Range: 0-1001                                                                                                                                                                                                                                                                         |
-| exec.queue.threshold                           | 30000000         | Sets the cost threshold, which depends on the complexity of the queries in queue, for determining whether query is large or small. Complex queries have higher thresholds. Range: 0-9223372036854775807                                                                                                                                                          |
+| exec.queue.threshold                           | 30000000         | Sets the cost threshold for determining whether query is large or small based on complexity. Complex queries have higher thresholds. By default, an estimated 30,000,000 rows will be processed by a query. Range: 0-9223372036854775807                                                                                                                         |
 | exec.queue.timeout_millis                      | 300000           | Indicates how long a query can wait in queue before the query fails. Range: 0-9223372036854775807                                                                                                                                                                                                                                                                |
 | exec.schedule.assignment.old                   | FALSE            | Used to prevent query failure when no work units are assigned to a minor fragment, particularly when the number of files is much larger than the number of leaf fragments.                                                                                                                                                                                       |
 | exec.storage.enable_new_text_reader            | TRUE             | Enables the text reader that complies with the RFC 4180 standard for text/csv files.                                                                                                                                                                                                                                                                             |
@@ -80,7 +80,4 @@ The sys.options table lists the following options that you can set as a system o
 | store.parquet.enable_dictionary_encoding       | FALSE            | For internal use. Do not change.                                                                                                                                                                                                                                                                                                                                 |
 | store.parquet.use_new_reader                   | FALSE            | Not supported in this release.                                                                                                                                                                                                                                                                                                                                   |
 | store.text.estimated_row_size_bytes            | 100              | Estimate of the row size in a delimited text file, such as csv. The closer to actual, the better the query plan. Used for all csv files in the system/session where the value is set. Impacts the decision to plan a broadcast join or not.                                                                                                                      |
-| window.enable                                  | FALSE            | Not supported in this release. Coming soon.                                                                                                                                                                                                                                                                                                                      |
-
-
-
+| window.enable                                  | FALSE            | Not supported in this release. Coming soon.                                                                                                                                                                                                                                                                                                                      |
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/drill/blob/c6be6cc5/_docs/connect-a-data-source/035-plugin-configuration-introduction.md
----------------------------------------------------------------------
diff --git a/_docs/connect-a-data-source/035-plugin-configuration-introduction.md b/_docs/connect-a-data-source/035-plugin-configuration-introduction.md
index 7c850ae..c6bcbf8 100644
--- a/_docs/connect-a-data-source/035-plugin-configuration-introduction.md
+++ b/_docs/connect-a-data-source/035-plugin-configuration-introduction.md
@@ -58,9 +58,9 @@ The following table describes the attributes you configure for storage plugins.
   </tr>
   <tr>
     <td>"workspaces". . . "location"</td>
-    <td>"location": "/"<br>"location": "/tmp"</td>
+    <td>"location": "/Users/johndoe/mydata"<br>"location": "/tmp"</td>
     <td>no</td>
-    <td>Path to a directory on the file system.</td>
+    <td>Full path to a directory on the file system.</td>
   </tr>
   <tr>
     <td>"workspaces". . . "writable"</td>

http://git-wip-us.apache.org/repos/asf/drill/blob/c6be6cc5/_docs/data-sources-and-file-formats/050-json-data-model.md
----------------------------------------------------------------------
diff --git a/_docs/data-sources-and-file-formats/050-json-data-model.md b/_docs/data-sources-and-file-formats/050-json-data-model.md
index 70ccf94..1b1660d 100644
--- a/_docs/data-sources-and-file-formats/050-json-data-model.md
+++ b/_docs/data-sources-and-file-formats/050-json-data-model.md
@@ -126,7 +126,9 @@ Using the following techniques, you can query complex, nested JSON:
 * Generate key/value pairs for loosely structured data
 
 ## Example: Flatten and Generate Key Values for Complex JSON
-This example uses the following data that represents unit sales of tickets to events that were sold over a period of for several days in different states:
+This example uses the following data that represents unit sales of tickets to events that were sold over a period of for several days in December:
+
+### ticket_sales.json Contents
 
     {
       "type": "ticket",
@@ -151,56 +153,32 @@ This example uses the following data that represents unit sales of tickets to ev
     
 Take a look at the data in Drill:
 
-    SELECT * FROM dfs.`/Users/drilluser/ticket_sales.json`;
-    +------------+------------+------------+------------+------------+
-    |    type    |  channel   |   month    |    day     |   sales    |
-    +------------+------------+------------+------------+------------+
-    | ticket     | 123455     | 12         | ["15","25","28","31"] | {"NY":"532806","PA":"112889","TX":"898999","UT":"10875"} |
-    | ticket     | 123456     | 12         | ["10","15","19","31"] | {"NY":"972880","PA":"857475","CA":"87350","OR":"49999"} |
-    +------------+------------+------------+------------+------------+
-    2 rows selected (0.041 seconds)
-
-### Flatten Arrays
-The FLATTEN function breaks the following _day arrays from the JSON example file shown earlier into separate rows.
-
-    "_day": [ 15, 25, 28, 31 ] 
-    "_day": [ 10, 15, 19, 31 ]
-
-Flatten the sales column of the ticket data onto separate rows, one row for each day in the array, for a better view of the data. FLATTEN copies the sales data related in the JSON object on each row.  Using the all (*) wildcard as the argument to flatten is not supported and returns an error.
-
-    SELECT flatten(tkt._day) AS `day`, tkt.sales FROM dfs.`/Users/drilluser/ticket_sales.json` tkt;
-
-    +------------+------------+
-    |    day     |   sales    |
-    +------------+------------+
-    | 15         | {"NY":532806,"PA":112889,"TX":898999,"UT":10875} |
-    | 25         | {"NY":532806,"PA":112889,"TX":898999,"UT":10875} |
-    | 28         | {"NY":532806,"PA":112889,"TX":898999,"UT":10875} |
-    | 31         | {"NY":532806,"PA":112889,"TX":898999,"UT":10875} |
-    | 10         | {"NY":972880,"PA":857475,"CA":87350,"OR":49999} |
-    | 15         | {"NY":972880,"PA":857475,"CA":87350,"OR":49999} |
-    | 19         | {"NY":972880,"PA":857475,"CA":87350,"OR":49999} |
-    | 31         | {"NY":972880,"PA":857475,"CA":87350,"OR":49999} |
-    +------------+------------+
-    8 rows selected (0.072 seconds)
+    +---------+---------+---------------------------------------------------------------+
+    |  type   |  venue  |                             sales                             |
+    +---------+---------+---------------------------------------------------------------+
+    | ticket  | 123455  | {"12-10":532806,"12-11":112889,"12-19":898999,"12-21":10875}  |
+    | ticket  | 123456  | {"12-10":87350,"12-19":49999,"12-21":857475,"12-15":972880}   |
+    +---------+---------+---------------------------------------------------------------+
+    2 rows selected (1.343 seconds)
+
 
 ### Generate Key/Value Pairs
-Use the KVGEN (Key Value Generator) function to generate key/value pairs from complex data. Generating key/value pairs is often helpful when working with data that contains arbitrary maps consisting of dynamic and unknown element names, such as the ticket sales data by state. For example purposes, take a look at how kvgen breaks the sales data into keys and values representing the states and number of tickets sold:
+Continuing with the data from [previous example]({{site.baseurl}}/docs/json-data-model/#example:-flatten-and-generate-key-values-for-complex-json), use the KVGEN (Key Value Generator) function to generate key/value pairs from complex data. Generating key/value pairs is often helpful when working with data that contains arbitrary maps consisting of dynamic and unknown element names, such as the ticket sales data in this example. For example purposes, take a look at how kvgen breaks the sales data into keys and values representing the key dates and number of tickets sold:
 
-    SELECT KVGEN(tkt.sales) AS state_sales FROM dfs.`/Users/drilluser/ticket_sales.json` tkt;
-    +-------------+
-    | state_sales |
-    +-------------+
-    | [{"key":"NY","value":532806},{"key":"PA","value":112889},{"key":"TX","value":898999},{"key":"UT","value":10875}] |
-    | [{"key":"NY","value":972880},{"key":"PA","value":857475},{"key":"CA","value":87350},{"key":"OR","value":49999}] |
-    +-------------+
-    2 rows selected (0.039 seconds)
+    SELECT KVGEN(tkt.sales) AS `key dates:tickets sold` FROM dfs.`/Users/drilluser/ticket_sales.json` tkt;
+    +---------------------------------------------------------------------------------------------------------------------------------------+
+    |                                                        key dates:tickets sold                                                         |
+    +---------------------------------------------------------------------------------------------------------------------------------------+
+    | [{"key":"12-10","value":"532806"},{"key":"12-11","value":"112889"},{"key":"12-19","value":"898999"},{"key":"12-21","value":"10875"}] |
+    | [{"key":"12-10","value":"87350"},{"key":"12-19","value":"49999"},{"key":"12-21","value":"857475"},{"key":"12-15","value":"972880"}] |
+    +---------------------------------------------------------------------------------------------------------------------------------------+
+    2 rows selected (0.106 seconds)
 
 KVGEN allows queries against maps where the keys themselves represent data rather than a schema, as shown in the next example.
 
 ### Flatten JSON Data
 
-FLATTEN breaks the list of key-value pairs into separate rows on which you can apply analytic functions. FLATTEN takes a JSON array, such as the output from kvgen(sales), as an argument. Using the all (*) wildcard as the argument is not supported and returns an error.
+FLATTEN breaks the list of key-value pairs into separate rows on which you can apply analytic functions. FLATTEN takes a JSON array, such as the output from kvgen(sales), as an argument. Using the all (*) wildcard as the argument is not supported and returns an error. The following example continues using data from the [previous example]({{site.baseurl}}/docs/json-data-model/#example:-flatten-and-generate-key-values-for-complex-json):
 
     SELECT FLATTEN(kvgen(sales)) Sales 
     FROM dfs.`/Users/drilluser/drill/ticket_sales.json`;
@@ -220,41 +198,41 @@ FLATTEN breaks the list of key-value pairs into separate rows on which you can a
     8 rows selected (0.171 seconds)
 
 ### Example: Aggregate Loosely Structured Data
-Use flatten and kvgen together to aggregate the data. Continuing with the previous example, make sure all text mode is set to false to sum numbers. Drill returns an error if you attempt to sum data in all text mode. 
+Use flatten and kvgen together to aggregate the data from the [previous example]({{site.baseurl}}/docs/json-data-model/#example:-flatten-and-generate-key-values-for-complex-json). Make sure all text mode is set to false to sum numbers. Drill returns an error if you attempt to sum data in all text mode. 
 
     ALTER SYSTEM SET `store.json.all_text_mode` = false;
     
 Sum the ticket sales by combining the `SUM`, `FLATTEN`, and `KVGEN` functions in a single query.
 
-    SELECT SUM(tkt.tot_sales.`value`) AS TotalSales FROM (SELECT flatten(kvgen(sales)) tot_sales FROM dfs.`/Users/drilluser/ticket_sales.json`) tkt;
+    SELECT SUM(tkt.tot_sales.`value`) AS TicketSold FROM (SELECT flatten(kvgen(sales)) tot_sales FROM dfs.`/Users/drilluser/ticket_sales.json`) tkt;
 
-    +------------+
-    | TotalSales |
-    +------------+
-    | 3523273    |
-    +------------+
-    1 row selected (0.081 seconds)
+    +--------------+
+    | TicketsSold  |
+    +--------------+
+    | 3523273.0    |
+    +--------------+
+    1 row selected (0.244 seconds)
 
 ### Example: Aggregate and Sort Data
-Sum the ticket sales by state and group by state and sort in ascending order. 
+Sum the ticket sales by state and group by day and sort in ascending order. 
 
-    SELECT `right`(tkt.tot_sales.key,2) State, 
+    SELECT `right`(tkt.tot_sales.key,2) `December Date`, 
     SUM(tkt.tot_sales.`value`) AS TotalSales 
-    FROM (SELECT flatten(kvgen(sales)) tot_sales 
+    FROM (SELECT FLATTEN(kvgen(sales)) tot_sales 
     FROM dfs.`/Users/drilluser/ticket_sales.json`) tkt 
     GROUP BY `right`(tkt.tot_sales.key,2) 
     ORDER BY TotalSales;
 
-    +---------------+--------------+
-    | December_Date |  TotalSales  |
-    +---------------+--------------+
-    | 11            | 112889       |
-    | 10            | 620156       |
-    | 21            | 868350       |
-    | 19            | 948998       |
-    | 15            | 972880       |
-    +---------------+--------------+
-    5 rows selected (0.203 seconds)
+    +----------------+-------------+
+    | December Date  | TotalSales  |
+    +----------------+-------------+
+    | 11             | 112889.0    |
+    | 10             | 620156.0    |
+    | 21             | 868350.0    |
+    | 19             | 948998.0    |
+    | 15             | 972880.0    |
+    +----------------+-------------+
+    5 rows selected (0.252 seconds)
 
 ### Example: Access a Map Field in an Array
 To access a map field in an array, use dot notation to drill down through the hierarchy of the JSON data to the field. Examples are based on the following [City Lots San Francisco in .json](https://github.com/zemirco/sf-city-lots-json), modified slightly as described in the empty array workaround in ["Limitations and Workarounds."]({{ site.baseurl }}/docs/json-data-model#empty-array)

http://git-wip-us.apache.org/repos/asf/drill/blob/c6be6cc5/_docs/query-data/query-a-file-system/030-querying-plain-text-files.md
----------------------------------------------------------------------
diff --git a/_docs/query-data/query-a-file-system/030-querying-plain-text-files.md b/_docs/query-data/query-a-file-system/030-querying-plain-text-files.md
index ab73c57..aeb3543 100644
--- a/_docs/query-data/query-a-file-system/030-querying-plain-text-files.md
+++ b/_docs/query-data/query-a-file-system/030-querying-plain-text-files.md
@@ -157,7 +157,7 @@ path and name of the file in back ticks.
 
   3. Change the file name to add a `.tsv` extension.  
 The Drill `dfs` storage plugin definition includes a TSV format that requires
-a file to have this extension.
+a file to have this extension. Later, you learn how to skip this step and query the GZ file directly.
 
 ### Query the Data
 
@@ -174,12 +174,12 @@ times a year in the books that Google scans.
      * In the WHERE clause, enclose the string literal "Zoological Journal of the Linnean" in single quotation marks.  
      * Limit the output to 10 rows.  
   
-         SELECT COLUMNS[0] AS Ngram,
-                COLUMNS[1] AS Publication_Date,
-                COLUMNS[2] AS Frequency
-         FROM `/Users/drilluser/Downloads/googlebooks-eng-all-5gram-20120701-zo.tsv`
-         WHERE ((columns[0] = 'Zoological Journal of the Linnean')
-             AND (columns[2] > 250)) LIMIT 10;
+            SELECT COLUMNS[0] AS Ngram,
+                   COLUMNS[1] AS Publication_Date,
+                   COLUMNS[2] AS Frequency
+            FROM `/Users/drilluser/Downloads/googlebooks-eng-all-5gram-20120701-zo.tsv`
+            WHERE ((columns[0] = 'Zoological Journal of the Linnean')
+            AND (columns[2] > 250)) LIMIT 10;
 
      The output is:
 
@@ -195,7 +195,7 @@ times a year in the books that Google scans.
          5 rows selected (1.175 seconds)
 
 The Drill default storage plugins support common file formats. If you need
-support for some other file format, such as GZ, create a custom storage plugin. You can also create a storage plugin to simplify querying file having long path names. A workspace name replaces the long path name.
+support for some other file format, such as GZ, create a custom storage plugin. You can also create a storage plugin to simplify querying files having long path names. A workspace name replaces the long path name.
 
 
 ## Create a Storage Plugin
@@ -203,7 +203,7 @@ support for some other file format, such as GZ, create a custom storage plugin.
 This example covers how to create and use a storage plugin to simplify queries or to query a file type that `dfs` does not specify, GZ in this case. First, you create the storage plugin in the Drill Web UI. Next, you connect to the
 file through the plugin to query a file.
 
-You can create a storage plugin using the Apache Drill Web UI to query the GZ file containing the compressed TSV data directly.
+You can create a storage plugin using the Apache Drill Web UI to query the GZ file containing the compressed TSV data.
 
   1. Create an `ngram` directory on your file system.
   2. Copy the GZ file `googlebooks-eng-all-5gram-20120701-zo.gz` to the `ngram` directory.
@@ -213,7 +213,7 @@ You can create a storage plugin using the Apache Drill Web UI to query the GZ fi
      ![new plugin]({{ site.baseurl }}/docs/img/ngram_plugin.png)    
   5. Click **Create**.  
      The Configuration screen appears.
-  6. Replace null with the following storage plugin definition, except on the location line, use the path to your `ngram` directory instead of the drilluser's path and give your workspace an arbitrary name, for example, ngram:
+  6. Replace null with the following storage plugin definition, except on the location line, use the *full* path to your `ngram` directory instead of the drilluser's path and give your workspace an arbitrary name, for example, ngram:
   
         {
           "type": "file",
@@ -288,7 +288,7 @@ This exercise shows how to query Ngram data when you are connected to `myplugin`
                 COLUMNS[2] 
          FROM ngram.`/googlebooks-eng-all-5gram-20120701-zo.gz` 
          WHERE ((columns[0] = 'Zoological Journal of the Linnean') 
-          AND (columns[2] > 250)) 
+         AND (columns[2] > 250)) 
          LIMIT 10;
 
      The five rows of output appear.