You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@drill.apache.org by br...@apache.org on 2018/11/07 21:51:56 UTC

[drill-site] branch asf-site updated: DRILL-6680 Add content to querying info schema page for new FILES table in Drill 1.15

This is an automated email from the ASF dual-hosted git repository.

bridgetb pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/drill-site.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new 1237877  DRILL-6680 Add content to querying info schema page for new FILES table in Drill 1.15
1237877 is described below

commit 123787722c17b124b59208873525d7f18fc49a26
Author: Bridget Bevens <bb...@maprtech.com>
AuthorDate: Wed Nov 7 13:51:41 2018 -0800

    DRILL-6680 Add content to querying info schema page for new FILES table in Drill 1.15
---
 docs/querying-the-information-schema/index.html | 161 +++++++++++++++++++++++-
 feed.xml                                        |   4 +-
 2 files changed, 157 insertions(+), 8 deletions(-)

diff --git a/docs/querying-the-information-schema/index.html b/docs/querying-the-information-schema/index.html
index 9595ce3..7ad193a 100644
--- a/docs/querying-the-information-schema/index.html
+++ b/docs/querying-the-information-schema/index.html
@@ -1272,7 +1272,7 @@
 
     </div>
 
-     Nov 2, 2018
+     Nov 7, 2018
 
     <link href="/css/docpage.css" rel="stylesheet" type="text/css">
 
@@ -1291,6 +1291,7 @@ context. You can query the following INFORMATION_SCHEMA tables:</p>
 <li>TABLES</li>
 <li>COLUMNS </li>
 <li>VIEWS</li>
+<li>FILES</li>
 </ul>
 
 <h2 id="schemata">SCHEMATA</h2>
@@ -1325,15 +1326,16 @@ can also return a list of schemas by running the SHOW DATABASES command:</p>
 </code></pre></div>
 <h2 id="catalogs">CATALOGS</h2>
 
-<p>The CATALOGS table returns only one row, with the hardcoded DRILL catalog name
+<p>The CATALOGS table returns only one row, with the hard-coded DRILL catalog name
 and description.</p>
 
 <h2 id="tables">TABLES</h2>
 
 <p>The TABLES table returns the table name and type for each table or view in
-your databases. (Type means TABLE or VIEW.) Note that Drill does not return
-files available for querying in file-based data sources. Instead, use SHOW
-FILES to explore these data sources.</p>
+your databases. (Type means TABLE or VIEW.) Starting in Drill 1.15, Drill returns
+files available for querying in file-based data sources. You no longer have to use the SHOW
+FILES command to explore these data sources. You can query the FILES table for directory and 
+file information. </p>
 
 <h2 id="columns">COLUMNS</h2>
 
@@ -1345,10 +1347,157 @@ type) for each column in each table or view.</p>
 <p>The VIEWS table returns the name and definition for each view in your
 databases. Note that file schemas are the canonical repository for views in
 Drill. Depending on how you create a view, the may only be displayed in Drill
-after it has been used.</p>
+after it has been used.  </p>
 
+<h2 id="files">FILES</h2>
+
+<p>Starting in Drill 1.15, the INFORMATION_SCHEMA contains a FILES table that you can query for information about directories and files stored in the <a href="/docs/workspaces/">workspaces</a> configured within your <a href="/docs/s3-storage-plugin/#configuring-the-s3-storage-plugin">S3</a> and <a href="/docs/file-system-storage-plugin/">file system</a> storage plugin configurations. </p>
+
+<p>The FILES table is useful for analyzing folders and files before you run queries against data sources configured in Drill. When you query the FILES table, the FILES table lists the directories and files based on the permissions set for the current or impersonated user.  </p>
+
+<p>The FILES table stores the following information about directories and files in workspaces:  </p>
+
+<ul>
+<li><strong>SCHEMA_NAME</strong><br>
+The file system storage plugin name with the schema name. For example, dfs.tmp.<br></li>
+<li><strong>ROOT_SCHEMA_NAME</strong><br>
+The file system storage plugin name. For example, dfs.<br></li>
+<li><strong>WORKSPACE_NAME</strong><br>
+The workspace name. For example, tmp.<br></li>
+<li><strong>FILE_NAME</strong><br>
+The name of the directories and files. For example, sample.txt. Drill lists directories and files based on the permissions set for the current or impersonated user.<br></li>
+<li><strong>RELATIVE_PATH</strong><br>
+The relative path to a file. For example, <code>sample_folder/sample.txt</code>; assuming that the full file path is <code>/tmp/sample_folder/sample.txt</code>, and the workspace path is <code>/tmp</code>.<br></li>
+<li><strong>IS_DIRECTORY</strong><br>
+Lists true if the object is a directory. Lists false if the object is a file.<br></li>
+<li><strong>IS_FILE</strong><br>
+Lists true if the object is a file. Lists false if the object is a directory.<br></li>
+<li><strong>LENGTH</strong><br>
+Size of the directory or file in bytes. For example, 1210.<br></li>
+<li><strong>OWNER</strong><br>
+File or directory owner. For example, root.<br></li>
+<li><strong>GROUP</strong><br>
+Group to which the file or directory belongs. For example, root.<br></li>
+<li><strong>PERMISSION</strong><br>
+Permission that the current or impersonated user has on the file. For example, rw-rw-rw.<br></li>
+<li><strong>ACCESS_TIME</strong><br>
+Timestamp denoting the last time the file or directory was accessed.<br></li>
+<li><strong>MODIFICATION_TIME</strong><br>
+Timestamp denoting the last time the file or directory was changed.<br></li>
+</ul>
+
+<h3 id="listing-files-recursively">Listing Files Recursively</h3>
+
+<p>The FILES table can list files recursively; however, listing files recursively can negatively impact performance. When you enable the <code>storage.list_files_recursively</code> option, the FILES table lists all the directories and files nested under the current workspace directory. The <code>storage.list_files_recursively</code> option is disabled (set to false) by default. Issue the SET command to enable recursive listing, as shown:  </p>
+<div class="highlight"><pre><code class="language-text" data-lang="text">SET `storage.list_files_recursively` = true;      
+</code></pre></div>
 <h2 id="useful-queries">Useful Queries</h2>
 
+<p>The following sections demonstrate how to query the FILES table and TABLES table in the INFORMATION_SCHEMA:  </p>
+
+<h3 id="files-queries">FILES Queries</h3>
+
+<p>This example demonstrates how to use the FILES table to explore workspaces and identify duplicate files across the workspaces that are configured in different S3 storage plugins. </p>
+
+<p>For this example, S3 buckets were configured as data sources in Drill. Storage plugins were configured to connect Drill to each of the S3 buckets. The storage plugin named <code>s3_home_bucket</code> contains personal files, and the storage plugin named <code>s3_work_bucket</code> contains work files. Naming the storage plugins with the s3 prefix simplifies the listing of available schemas in the SCHEMATA table, as shown:  </p>
+<div class="highlight"><pre><code class="language-text" data-lang="text">0: jdbc:drill:zk=local&gt; select * from information_schema.schemata where schema_name like &#39;s3%&#39;;
++---------------+--------------------------+---------------+-------+-------------+
+| CATALOG_NAME  |       SCHEMA_NAME        | SCHEMA_OWNER  | TYPE  | IS_MUTABLE  |
++---------------+--------------------------+---------------+-------+-------------+
+| DRILL         | s3_home_bucket.default   | &lt;owner&gt;       | file  | NO          |
+| DRILL         | s3_home_bucket.root      | &lt;owner&gt;       | file  | NO          |
+| DRILL         | s3_work_bucket.default   | &lt;owner&gt;       | file  | NO          |
+| DRILL         | s3_work_bucket.root      | &lt;owner&gt;       | file  | NO          |
+| DRILL         | s3_years_bucket.default  | &lt;owner&gt;       | file  | NO          |
+| DRILL         | s3_years_bucket.root     | &lt;owner&gt;       | file  | NO          |
++---------------+--------------------------+---------------+-------+-------------+  
+</code></pre></div>
+<p>Querying the FILES table and filtering on the SCHEMA_NAME provides information about the files that exist within a workspace:  </p>
+
+<p><strong>Note:</strong> The word “files” is a reserved word in Drill and requires backticks (``).   </p>
+<div class="highlight"><pre><code class="language-text" data-lang="text">0: jdbc:drill:zk=local&gt; select * from information_schema.`files` where schema_name = &#39;s3_home_bucket.root&#39;;
++----------------------+-------------------+-----------------+-----------------------------+-----------------------------+---------------+----------+---------+--------+--------+-------------+------------------------+------------------------+
+|     SCHEMA_NAME      | ROOT_SCHEMA_NAME  | WORKSPACE_NAME  |          FILE_NAME          |        RELATIVE_PATH        | IS_DIRECTORY  | IS_FILE  | LENGTH  | OWNER  | GROUP  | PERMISSION  |      ACCESS_TIME       |   MODIFICATION_TIME    |
++----------------------+-------------------+-----------------+-----------------------------+-----------------------------+---------------+----------+---------+--------+--------+-------------+------------------------+------------------------+
+| s3_home_bucket.root  | s3_home_bucket    | root            | date_dim.txt                | date_dim.txt                | false         | true     | 0       | root   | root   | rw-r--r--   | 1969-12-31 16:00:00.0  | 2018-11-06 16:38:16.0  |
+| s3_home_bucket.root  | s3_home_bucket    | root            | household_demographics.txt  | household_demographics.txt  | false         | true     | 0       | root   | root   | rw-r--r--   | 1969-12-31 16:00:00.0  | 2018-11-06 16:38:28.0  |
+| s3_home_bucket.root  | s3_home_bucket    | root            | promotion.txt               | promotion.txt               | false         | true     | 0       | root   | root   | rw-r--r--   | 1969-12-31 16:00:00.0  | 2018-11-06 16:38:35.0  |
+| s3_home_bucket.root  | s3_home_bucket    | root            | time_dim.txt                | time_dim.txt                | false         | true     | 0       | root   | root   | rw-r--r--   | 1969-12-31 16:00:00.0  | 2018-11-06 16:38:43.0  |
++----------------------+-------------------+-----------------+-----------------------------+-----------------------------+---------------+----------+---------+--------+--------+-------------+------------------------+------------------------+  
+
+0: jdbc:drill:zk=local&gt; select * from information_schema.`files` where schema_name = &#39;s3_work_bucket.root&#39;;
++----------------------+-------------------+-----------------+-----------------------------+-----------------------------+---------------+----------+---------+--------+--------+-------------+------------------------+------------------------+
+|     SCHEMA_NAME      | ROOT_SCHEMA_NAME  | WORKSPACE_NAME  |          FILE_NAME          |        RELATIVE_PATH        | IS_DIRECTORY  | IS_FILE  | LENGTH  | OWNER  | GROUP  | PERMISSION  |      ACCESS_TIME       |   MODIFICATION_TIME    |
++----------------------+-------------------+-----------------+-----------------------------+-----------------------------+---------------+----------+---------+--------+--------+-------------+------------------------+------------------------+
+| s3_work_bucket.root  | s3_work_bucket    | root            | customer.txt                | customer.txt                | false         | true     | 0       | root   | root   | rw-r--r--   | 1969-12-31 16:00:00.0  | 2018-11-06 16:55:36.0  |
+| s3_work_bucket.root  | s3_work_bucket    | root            | household_demographics.txt  | household_demographics.txt  | false         | true     | 0       | root   | root   | rw-r--r--   | 1969-12-31 16:00:00.0  | 2018-11-06 16:55:59.0  |
+| s3_work_bucket.root  | s3_work_bucket    | root            | item.txt                    | item.txt                    | false         | true     | 0       | root   | root   | rw-r--r--   | 1969-12-31 16:00:00.0  | 2018-11-06 16:56:13.0  |
+| s3_work_bucket.root  | s3_work_bucket    | root            | promotion.txt               | promotion.txt               | false         | true     | 0       | root   | root   | rw-r--r--   | 1969-12-31 16:00:00.0  | 2018-11-06 16:56:29.0  |
++----------------------+-------------------+-----------------+-----------------------------+-----------------------------+---------------+----------+---------+--------+--------+-------------+------------------------+------------------------+
+</code></pre></div>
+<p>Notice that the FILE_NAME column lists the files stored in the workspaces. You can see that duplicate files exist in the work and home buckets. Alternatively, you can see the duplicate files by querying the FILE_NAME column directly and filtering on SCHEMA_NAME and IS_FILE, as shown:  </p>
+<div class="highlight"><pre><code class="language-text" data-lang="text">0: jdbc:drill:zk=local&gt; select file_name from information_schema.`files` where schema_name = &#39;s3_home_bucket.root&#39; and is_file is true;
++-----------------------------+
+|          file_name          |
++-----------------------------+
+| date_dim.txt                |
+| household_demographics.txt  |
+| promotion.txt               |
+| time_dim.txt                |
++-----------------------------+
+
+0: jdbc:drill:zk=local&gt; Select file_name from information_schema.`files` where schema_name = &#39;s3_work_bucket.root&#39; and is_file is true;
++-----------------------------+
+|          file_name          |
++-----------------------------+
+| customer.txt                |
+| household_demographics.txt  |
+| item.txt                    |
+| promotion.txt               |
++-----------------------------+  
+</code></pre></div>
+<p>Issuing a slightly more complex query on the FILES table reveals the duplicate files across the two schemas:  </p>
+<div class="highlight"><pre><code class="language-text" data-lang="text">0: jdbc:drill:zk=local&gt; select file_name from information_schema.`files` where schema_name in (&#39;s3_work_bucket.root&#39;, &#39;s3_home_bucket.root&#39;) and is_file is true group by file_name having count(file_name) &gt; 1;
++-----------------------------+
+|          file_name          |
++-----------------------------+
+| household_demographics.txt  |
+| promotion.txt               |
++-----------------------------+  
+</code></pre></div>
+<p>By default, the FILES table does not list the files recursively. Another schema named <code>s3_years_bucket.root</code> contains three folders with files in it, as shown:  </p>
+<div class="highlight"><pre><code class="language-text" data-lang="text">0: jdbc:drill:zk=local&gt; Select file_name, is_directory from information_schema.`files` where schema_name = &#39;s3_years_bucket.root&#39;;
++------------+---------------+
+| file_name  | is_directory  |
++------------+---------------+
+| 2016       | true          |
+| 2017       | true          |
+| 2018       | true          |
++------------+---------------+  
+</code></pre></div>
+<p>Though the folders contain files, the FILES table does not list the files nested inside the folders unless we enable the <code>storage.list_files_recursively</code> option, as shown:  </p>
+<div class="highlight"><pre><code class="language-text" data-lang="text">0: jdbc:drill:zk=local&gt; SET `storage.list_files_recursively` = true;
++-------+------------------------------------------+
+|  ok   |                 summary                  |
++-------+------------------------------------------+
+| true  | storage.list_files_recursively updated.  |
++-------+------------------------------------------+  
+</code></pre></div>
+<p>With recursive listing enabled, you can see that the same query run against the schema reveals the nested files in the folders:  </p>
+<div class="highlight"><pre><code class="language-text" data-lang="text">0: jdbc:drill:zk=local&gt; select file_name, relative_path, is_directory, is_file from information_schema.`files` where schema_name = &#39;s3_years_bucket.root&#39;;
++--------------------------+-------------------------------+---------------+----------+
+|        file_name         |         relative_path         | is_directory  | is_file  |
++--------------------------+-------------------------------+---------------+----------+
+| 2016                     | 2016                          | true          | false    |
+| profile_2016_01_01.json  | 2016/profile_2016_01_01.json  | false         | true     |
+| 2017                     | 2017                          | true          | false    |
+| profile_2017_01_01.json  | 2017/profile_2017_01_01.json  | false         | true     |
+| 2018                     | 2018                          | true          | false    |
+| profile_2018_01_01.json  | 2018/profile_2018_01_01.json  | false         | true     |
++--------------------------+-------------------------------+---------------+----------+
+</code></pre></div>
+<h3 id="tables-queries">TABLES Queries</h3>
+
 <p>Run an <code>INFORMATION_SCHEMA.`TABLES`</code>query to view all of the tables and views
 within a database. TABLES is a reserved word in Drill and requires back ticks
 (`).</p>
diff --git a/feed.xml b/feed.xml
index f91fea7..bcbf137 100644
--- a/feed.xml
+++ b/feed.xml
@@ -6,8 +6,8 @@
 </description>
     <link>/</link>
     <atom:link href="/feed.xml" rel="self" type="application/rss+xml"/>
-    <pubDate>Tue, 06 Nov 2018 15:56:58 -0800</pubDate>
-    <lastBuildDate>Tue, 06 Nov 2018 15:56:58 -0800</lastBuildDate>
+    <pubDate>Wed, 07 Nov 2018 13:48:56 -0800</pubDate>
+    <lastBuildDate>Wed, 07 Nov 2018 13:48:56 -0800</lastBuildDate>
     <generator>Jekyll v2.5.2</generator>
     
       <item>