You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@drill.apache.org by br...@apache.org on 2015/05/27 02:16:16 UTC

[4/4] drill git commit: DRILL-3169 multiple dir

DRILL-3169 multiple dir


Project: http://git-wip-us.apache.org/repos/asf/drill/repo
Commit: http://git-wip-us.apache.org/repos/asf/drill/commit/446d71c2
Tree: http://git-wip-us.apache.org/repos/asf/drill/tree/446d71c2
Diff: http://git-wip-us.apache.org/repos/asf/drill/diff/446d71c2

Branch: refs/heads/gh-pages
Commit: 446d71c242edf6ed6e65924e1b4089677540f151
Parents: fac8fd4
Author: Kristine Hahn <kh...@maprtech.com>
Authored: Tue May 26 16:48:37 2015 -0700
Committer: Bridget Bevens <bb...@maprtech.com>
Committed: Tue May 26 17:14:30 2015 -0700

----------------------------------------------------------------------
 .../030-querying-plain-text-files.md            | 95 ++------------------
 .../040-querying-directories.md                 | 45 ++--------
 2 files changed, 12 insertions(+), 128 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/drill/blob/446d71c2/_docs/query-data/query-a-file-system/030-querying-plain-text-files.md
----------------------------------------------------------------------
diff --git a/_docs/query-data/query-a-file-system/030-querying-plain-text-files.md b/_docs/query-data/query-a-file-system/030-querying-plain-text-files.md
index aeb3543..f79f2b9 100644
--- a/_docs/query-data/query-a-file-system/030-querying-plain-text-files.md
+++ b/_docs/query-data/query-a-file-system/030-querying-plain-text-files.md
@@ -194,104 +194,23 @@ times a year in the books that Google scans.
          +------------------------------------+-------------------+------------+
          5 rows selected (1.175 seconds)
 
-The Drill default storage plugins support common file formats. If you need
-support for some other file format, such as GZ, create a custom storage plugin. You can also create a storage plugin to simplify querying files having long path names. A workspace name replaces the long path name.
+The Drill default storage plugins support common file formats. 
 
 
-## Create a Storage Plugin
+## Query the GZ File Directly
 
-This example covers how to create and use a storage plugin to simplify queries or to query a file type that `dfs` does not specify, GZ in this case. First, you create the storage plugin in the Drill Web UI. Next, you connect to the
-file through the plugin to query a file.
+This example covers how to query the GZ file containing the compressed TSV data. The GZ file name needs to be renamed to specify the type of delimited file, such as CSV or TSV. You add `.tsv` before the `.gz` extension in this example.
 
-You can create a storage plugin using the Apache Drill Web UI to query the GZ file containing the compressed TSV data.
-
-  1. Create an `ngram` directory on your file system.
-  2. Copy the GZ file `googlebooks-eng-all-5gram-20120701-zo.gz` to the `ngram` directory.
-  3. Open the Drill Web UI by navigating to <http://localhost:8047/storage>.   
-     To open the Drill Web UI, the [Drill shell]({{site.baseurl}}/docs/starting-drill-on-linux-and-mac-os-x/) must still be running.
-  4. In New Storage Plugin, type `myplugin`.  
-     ![new plugin]({{ site.baseurl }}/docs/img/ngram_plugin.png)    
-  5. Click **Create**.  
-     The Configuration screen appears.
-  6. Replace null with the following storage plugin definition, except on the location line, use the *full* path to your `ngram` directory instead of the drilluser's path and give your workspace an arbitrary name, for example, ngram:
-  
-        {
-          "type": "file",
-          "enabled": true,
-          "connection": "file:///",
-          "workspaces": {
-            "ngram": {
-              "location": "/Users/drilluser/ngram",
-              "writable": false,
-              "defaultInputFormat": null
-           }
-         },
-         "formats": {
-           "tsv": {
-             "type": "text",
-             "extensions": [
-               "gz"
-             ],
-             "delimiter": "\t"
-            }
-          }
-        }
-
-  7. Click **Create**.  
-     The success message appears briefly.
-  8. Click **Back**.  
-     The new plugin appears in Enabled Storage Plugins.  
-     ![new plugin]({{ site.baseurl }}/docs/img/ngram_plugin.png) 
-  9. Go back to the Drill shell, and list the storage plugins.  
-          SHOW DATABASES;
-
-          +---------------------+
-          |     SCHEMA_NAME     |
-          +---------------------+
-          | INFORMATION_SCHEMA  |
-          | cp.default          |
-          | dfs.default         |
-          | dfs.root            |
-          | dfs.tmp             |
-          | myplugin.default    |
-          | myplugin.ngram      |
-          | sys                 |
-          +---------------------+
-          8 rows selected (0.105 seconds)
-
-Your custom plugin appears in the list and has two workspaces: the `ngram`
-workspace that you defined and a default workspace.
-
-### Connect to and Query a File
-
-When querying the same data source repeatedly, avoiding long path names is
-important. This exercise demonstrates how to simplify the query. Instead of
-using the full path to the Ngram file, you use dot notation in the FROM
-clause.
-
-``<workspace name>.`<location>```
-
-This syntax assumes you connected to a storage plugin that defines the
-location of the data. To query the data source while you are _not_ connected to
-that storage plugin, include the plugin name:
-
-``<plugin name>.<workspace name>.`<location>```
-
-This exercise shows how to query Ngram data when you are connected to `myplugin`.
-
-  1. Connect to the ngram file through the custom storage plugin.  
-     `USE myplugin;`
-  2. Get data about "Zoological Journal of the Linnean" that appears more than 250 times a year in the books that Google scans. In the FROM clause, instead of using the full path to the file as you did in the last exercise, connect to the data using the storage plugin workspace name ngram.
+  1. Rename the GZ file `googlebooks-eng-all-5gram-20120701-zo.gz` to googlebooks-eng-all-5gram-20120701-zo.tsv.gz.
+  2. Query the renamed GZ file directly to get data about "Zoological Journal of the Linnean" that appears more than 250 times a year in the books that Google scans. In the FROM clause, instead of using the full path to the file as you did in the last exercise, connect to the data using the storage plugin workspace name ngram.
   
          SELECT COLUMNS[0], 
                 COLUMNS[1], 
                 COLUMNS[2] 
-         FROM ngram.`/googlebooks-eng-all-5gram-20120701-zo.gz` 
+         FROM dfs.`/Users/drilluser/Downloads/googlebooks-eng-all-5gram-20120701-zo.tsv.gz` 
          WHERE ((columns[0] = 'Zoological Journal of the Linnean') 
          AND (columns[2] > 250)) 
          LIMIT 10;
 
-     The five rows of output appear.  
-
-To continue with this example and query multiple files in a directory, see the section, ["Example of Querying Multiple Files in a Directory"]({{site.baseurl}}/docs/querying-directories/#example-of-querying-multiple-files-in-a-directory).
+     The 5 rows of output appear.  
 

http://git-wip-us.apache.org/repos/asf/drill/blob/446d71c2/_docs/query-data/query-a-file-system/040-querying-directories.md
----------------------------------------------------------------------
diff --git a/_docs/query-data/query-a-file-system/040-querying-directories.md b/_docs/query-data/query-a-file-system/040-querying-directories.md
index 4a5b4ae..88b5b40 100644
--- a/_docs/query-data/query-a-file-system/040-querying-directories.md
+++ b/_docs/query-data/query-a-file-system/040-querying-directories.md
@@ -13,8 +13,8 @@ same structure: `plays.csv` and `moreplays.csv`. The first file contains 7
 records and the second file contains 3 records. The following query returns
 the "union" of the two files, ordered by the first column:
 
-    0: jdbc:drill:zk=local> select columns[0] as `Year`, columns[1] as Play 
-    from dfs.`/Users/brumsby/drill/testdata` order by 1;
+    0: jdbc:drill:zk=local> SELECT COLUMNS[0] AS `Year`, COLUMNS[1] AS Play 
+    FROM dfs.`/Users/brumsby/drill/testdata` order by 1;
  
     +------------+------------------------+
     |    Year    |          Play          |
@@ -49,7 +49,7 @@ You can query all of these files, or a subset, by referencing the file system
 once in a Drill query. For example, the following query counts the number of
 records in all of the files inside the `2013` directory:
 
-    0: jdbc:drill:> select count(*) from MFS.`/mapr/drilldemo/labs/clicks/logs/2013` ;
+    0: jdbc:drill:> SELECT COUNT(*) FROM MFS.`/mapr/drilldemo/labs/clicks/logs/2013` ;
     +------------+
     |   EXPR$0   |
     +------------+
@@ -64,7 +64,7 @@ subdirectories: `2012`, `2013`, and `2014`. The following query constrains
 files inside the subdirectory named `2013`. The variable `dir0` refers to the
 first level down from logs, `dir1` to the next level, and so on.
 
-    0: jdbc:drill:> use bob.logdata;
+    0: jdbc:drill:> USE bob.logdata;
     +------------+-----------------------------------------+
     |     ok     |              summary                    |
     +------------+-----------------------------------------+
@@ -72,7 +72,7 @@ first level down from logs, `dir1` to the next level, and so on.
     +------------+-----------------------------------------+
     1 row selected (0.305 seconds)
  
-    0: jdbc:drill:> select * from logs where dir0='2013' limit 10;
+    0: jdbc:drill:> SELECT * FROM logs WHERE dir0='2013' LIMIT 10;
     +------------+------------+------------+------------+------------+------------+------------+------------+------------+-------------+
     |    dir0    |    dir1    |  trans_id  |    date    |    time    |  cust_id   |   device   |   state    |  camp_id   |  keywords   |
     +------------+------------+------------+------------+------------+------------+------------+------------+------------+-------------+
@@ -89,38 +89,3 @@ first level down from logs, `dir1` to the next level, and so on.
     +------------+------------+------------+------------+------------+------------+------------+------------+------------+-------------+
     10 rows selected (0.583 seconds)
 
-## Example of Querying Multiple Files in a Directory
-
-This example is a continuation of the example in the section, ["Example of Querying a TSV File"]({{site.baseurl}}/docs/querying-plain-text-files/#example-of-querying-a-tsv-file) that creates a subdirectory in the `ngram` directory and [custom plugin workspace]({{site.baseurl}}/docs/querying-plain-text-files/#create-a-storage-plugin) you created earlier.
-
-You download a second Ngram file. Next, you
-move both Ngram GZ files you downloaded to the `ngram` subdirectory. Finally, using the custom
-plugin workspace, you query both files. In the FROM clause, simply reference
-the subdirectory.
-
-  1. Download a second file of compressed Google Ngram data from this location: 
-  
-     http://storage.googleapis.com/books/ngrams/books/googlebooks-eng-all-2gram-20120701-ze.gz
-  2. Move `googlebooks-eng-all-2gram-20120701-ze.gz` to the `ngram/myfiles` subdirectory. 
-  3. Move the 5gram file you downloaded earlier `googlebooks-eng-all-5gram-20120701-zo.gz` to the `ngram/myfiles` subdirectory.
-  4. In the Drill shell, use the `myplugin.ngrams` workspace. 
-   
-          USE myplugin.ngram;
-  5. Query the myfiles directory for the "Zoological Journal of the Linnean" or "zero temperatures" in books published in 1998.
-  
-          SELECT * 
-          FROM myfiles 
-          WHERE (((COLUMNS[0] = 'Zoological Journal of the Linnean')
-            OR (COLUMNS[0] = 'zero temperatures')) 
-            AND (COLUMNS[1] = '1998'));
-The output lists ngrams from both files.
-
-          +----------------------------------------------------------+
-          |                         columns                          |
-          +----------------------------------------------------------+
-          | ["Zoological Journal of the Linnean","1998","157","53"]  |
-          | ["zero temperatures","1998","628","487"]                 |
-          +----------------------------------------------------------+
-          2 rows selected (7.007 seconds)
-
-For more information about querying directories, see the section, ["Query Directory Functions"]({{site.baseurl}}/docs/query-directory-functions).
\ No newline at end of file