You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@drill.apache.org by dz...@apache.org on 2023/02/21 09:09:55 UTC

[drill-site] branch master updated (d1dede5c1 -> e9c3444f8)

This is an automated email from the ASF dual-hosted git repository.

dzamo pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/drill-site.git


    from d1dede5c1 Document INTERSECT and EXCEPT.
     new 98eb71761 Document the Google Sheets storage plugin.
     new 813d2cd64 Document DRILL-8290: Early exit from recursive file listing for LIMIT 0 queries.
     new e9c3444f8 Document DRILL-8380: Remove customised SqlValidatorImpl.deriveAlias.

The 3 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 .../plugins/170-google-sheets-plugin.md            | 200 +++++++++++++++++++++
 .../sql-commands/031-create-temporary-table-as.md  |  20 ++-
 .../sql-reference/sql-commands/084-limit-clause.md |   6 +
 3 files changed, 217 insertions(+), 9 deletions(-)
 create mode 100644 _docs/en/connect-a-data-source/plugins/170-google-sheets-plugin.md


[drill-site] 02/03: Document DRILL-8290: Early exit from recursive file listing for LIMIT 0 queries.

Posted by dz...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

dzamo pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/drill-site.git

commit 813d2cd64d67b2f8e11ba3ba2520e7313c6980bc
Author: James Turton <ja...@somecomputer.xyz>
AuthorDate: Tue Feb 21 10:56:18 2023 +0200

    Document DRILL-8290: Early exit from recursive file listing for LIMIT 0 queries.
---
 _docs/en/sql-reference/sql-commands/084-limit-clause.md | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/_docs/en/sql-reference/sql-commands/084-limit-clause.md b/_docs/en/sql-reference/sql-commands/084-limit-clause.md
index df9f99544..160128628 100644
--- a/_docs/en/sql-reference/sql-commands/084-limit-clause.md
+++ b/_docs/en/sql-reference/sql-commands/084-limit-clause.md
@@ -86,6 +86,12 @@ LIMIT 0 optimizations do not work for queries with the UNION [ALL] set operator
 - CONVERT_TOEXTENDEDJSON
 - AVG (window function)
 
+**Recursive file listing in Drill 1.21+**
+Since version 1.21, Drill will exit early from recursive file listing during the planning of a query against filesystem storage if has detected a LIMIT 0 in the outermost SELECT. This optimization is aimed at queries of the following form.
+```
+SELECT * FROM dfs.`huge_directory` LIMIT 0;
+```
+
 ## Examples
 The following example query includes the ORDER BY and LIMIT clauses and returns the top 20 sales totals by month and state:
 


[drill-site] 03/03: Document DRILL-8380: Remove customised SqlValidatorImpl.deriveAlias.

Posted by dz...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

dzamo pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/drill-site.git

commit e9c3444f85e3f12c53ca68db84faf869f8f8694b
Author: James Turton <ja...@somecomputer.xyz>
AuthorDate: Tue Feb 21 11:09:13 2023 +0200

    Document DRILL-8380: Remove customised SqlValidatorImpl.deriveAlias.
---
 .../sql-commands/031-create-temporary-table-as.md    | 20 +++++++++++---------
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/_docs/en/sql-reference/sql-commands/031-create-temporary-table-as.md b/_docs/en/sql-reference/sql-commands/031-create-temporary-table-as.md
index 43c2b4c19..de7275f63 100644
--- a/_docs/en/sql-reference/sql-commands/031-create-temporary-table-as.md
+++ b/_docs/en/sql-reference/sql-commands/031-create-temporary-table-as.md
@@ -3,7 +3,9 @@ title: "CREATE TEMPORARY TABLE AS (CTTAS)"
 slug: "CREATE TEMPORARY TABLE AS (CTTAS)"
 parent: "SQL Commands"
 ---
-As of Drill 1.10, you can use the CREATE TEMPORARY TABLE AS (CTTAS) command to store the results of a query in a temporary table. You can reference the temporary table in subsequent queries within the same session, thereby improving query performance. Data written to the temporary table is not permanently stored on the filesystem. Drill automatically drops the temporary table once the session ends or the Drillbit process fails. Therefore, you do not have to manually drop the table.
+
+**Introduced in release: 1.10**
+You can use the CREATE TEMPORARY TABLE AS (CTTAS) command to store the results of a query in a temporary table. You can reference the temporary table in subsequent queries within the same session, thereby improving query performance. Data written to the temporary table is not permanently stored on the filesystem. Drill automatically drops the temporary table once the session ends or the Drillbit process fails. Therefore, you do not have to manually drop the table.
 
 {% include startnote.html %}You cannot create a view over a temporary table.{% include endnote.html %}
 
@@ -49,6 +51,7 @@ To override the default temporary workspace, define `drill.exec.default_temporar
 To change the default temporary workspace connection or directory path, update the dfs storage plugin on the Storage page in the Drill Web UI. For example, dfs connection attribute can be `file:///` or `maprfs:///` and the `dfs.tmp` location attribute can be `/tmp` or `/tmp2`. For more information, see [Plugin Configuration Basics]({{site.baseurl}}/docs/plugin-configuration-basics).
 
 Note: When you run Drill in distributed mode, verify that the default temporary workspace connection points to a distributed filesystem. If temporary tables are created on the local filesystem, they can only be accessed by the local Drillbit that created the temporary table.
+
 ### Setting the Storage Format
 The default storage format for temporary tables is parquet. However, you can create temporary tables in one of the following formats:
 
@@ -61,7 +64,6 @@ To change the storage format, set the `store.format` option before you create th
 		ALTER SESSION SET `store.format`='json';
 
 ### Creation of Temporary Tables and User Access
-
 In general, the user that creates a temporary table can run queries against the table, as long as the session in which the table was created is active. Although temporary tables are actually directories, you query the temporary table directory as you would query a table.
 
 When you create a temporary table, Drill creates a temporary location named after the session ID to store temporary tables associated with the session. Drill writes temporary table files with masked filenames, such as 0_0_0.parquet, to temporary table directories within the session’s temporary location. Internally, Drill masks each temporary table directory name. Therefore, when you submit a query with the temporary table name, Drill resolves the temporary table name to the internal, mas [...]
@@ -72,24 +74,24 @@ Drill creates the session and temporary table directories and files with the fol
 * Only the owner can create and read the files within the folders.
 
 ### Authorization
-
 When authorization is enabled, the user that started the Drillbit will own the temporary tables created within that session. However, the user that created the table will have the ability to query the data. For example, the *mapr* user starts the Drillbit and *Sally* starts the Drill shell. When authorization is enabled, *Sally* can submit a query to create a temporary table and she can query the temporary table data. However, the *mapr* user is the owner of the temporary table directory [...]
 
 ### Impersonation
-
 When impersonation is enabled, only the user that created the temporary table can query the temporary table. For example, if the Drillbit was started by the *mapr* user and user *Tom* starts the Drill shell and creates a temporary table, *Tom* will have permission to query the temporary table. *Tom* can also access temporary tables that he created in a different session as long as he has the full path to the temporary table directory and the session is still active. However, *Sally* cann [...]
 
 ### Selection of Tables
 
-When you mention a table name in a SELECT statement, any temporary table with that name takes priority over a table with the same name in the current workspace.
+Prior to Drill 1.21 when you mentioned a table name in a SELECT statement, any temporary table with that name would take priority over a table with the same name in the current workspace. Starting in Drill 1.21, temporary tables behave the same way that conventional table names do.
 
-For example, when you issue a SELECT statement on a table name that is common among the default temporary tables workspace and the current workspace, the temporary table is returned:
+For example, when you issue a SELECT statement on a table name that is common among the default temporary tables workspace and the current workspace, the current workspace's table is returned:
 
-       USE dfs.json;
+```sql
+USE dfs.json;
 
-       SELECT * FROM donuts; //returns table from dfs.tmp
+SELECT * FROM donuts; // returns table from dfs.json
 
-       SELECT* FROM dfs.json.donuts; //returns table from dfs.json
+SELECT* FROM dfs.tmp.donuts; //returns temporary table from dfs.tmp
+```
 
 ### Drop a Temporary Table
 


[drill-site] 01/03: Document the Google Sheets storage plugin.

Posted by dz...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

dzamo pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/drill-site.git

commit 98eb717612f370ce4103702378dcb4666cadca7a
Author: James Turton <ja...@somecomputer.xyz>
AuthorDate: Tue Feb 21 10:45:53 2023 +0200

    Document the Google Sheets storage plugin.
---
 .../plugins/170-google-sheets-plugin.md            | 200 +++++++++++++++++++++
 1 file changed, 200 insertions(+)

diff --git a/_docs/en/connect-a-data-source/plugins/170-google-sheets-plugin.md b/_docs/en/connect-a-data-source/plugins/170-google-sheets-plugin.md
new file mode 100644
index 000000000..ec9b30a9f
--- /dev/null
+++ b/_docs/en/connect-a-data-source/plugins/170-google-sheets-plugin.md
@@ -0,0 +1,200 @@
+---
+title: "Google Sheets Storage Plugin"
+slug: "Google Sheets Storage Plugin"
+parent: "Connect a Data Source"
+---
+
+**Introduced in release:** 1.21
+
+This plugin enables you to query and write to Google Sheets.
+
+## Obtaining credentials from Google
+Google Sheets uses OAuth2.0 (may it be quickly deprecated) for authorization. In order to query Google Sheets, you will first need to obtain three artifacts:
+
+* Your `clientID`:  This is an identifier which uniquely identifies your application to Google
+* Your `client_secret`: You can think of this as your password for your application to access Google Sheets
+* Your redirect URL:  This is the URL to which Google will send the various access tokens and which you will need later.  For a local installation of Drill, it will be:
+  `http://localhost:8047/credentials/<plugin name>/update_oauth2_authtoken`.
+
+1. To obtain the `clientID` and `clientSecret` you will need to obtain the Google keys, open the [Google Sheets API](https://console.cloud.google.com/apis/library/sheets.googleapis.com) and click on the `Enable` button.
+2. Once you've enabled the API, you will be taken to the API Manager.  Either select an existing project or create a new one.
+3. Next, navigate to the `Credentials` in the left panel.
+4. Click on `+Create Credentials` at the top of the page.  Select `OAuth client ID` and select `Web Application` or `Desktop` as the type.  Follow the instructions and download
+   the JSON file that Google provides.
+
+Drill does not use the JSON file, but you will be cutting and pasting values from the JSON file into the Drill configuration.
+
+## Configuring Drill
+Create a storage plugin following the normal procedure for doing so.  You can use the example below as a template.  Cut and paste the `clientID` and `client_secret` from the
+JSON file into your Drill configuration as shown below.  Once you've done that, save the configuration.
+
+```json
+{
+  "type": "googlesheets",
+  "allTextMode": true,
+  "extractHeaders": true,
+  "oAuthConfig": {
+    "callbackURL": "http://localhost:8047/credentials/googlesheets/update_oauth2_authtoken",
+    "authorizationURL": "https://accounts.google.com/o/oauth2/auth",
+    "authorizationParams": {
+      "response_type": "code",
+      "scope": "https://www.googleapis.com/auth/spreadsheets https://www.googleapis.com/auth/drive.readonly https://www.googleapis.com/auth/drive.metadata.readonly"
+    }
+  },
+  "credentialsProvider": {
+    "credentialsProviderType": "PlainCredentialsProvider",
+    "credentials": {
+      "clientID": "<YOUR CLIENT ID>",
+      "clientSecret": "<YOUR CLIENT SECRET>",
+      "tokenURI": "https://oauth2.googleapis.com/token"
+    },
+    "userCredentials": {}
+  },
+  "enabled": true,
+  "authMode": "SHARED_USER"
+}
+```
+
+With the exception of the clientID, client_secret and redirects, you should not have to modify any of the other parameters in the configuration.
+
+### Other Configuration Parameters
+There are two configuration parameters which you may want to adjust:
+* `allTextMode`:  This parameter when `true` disables Drill's data type inferencing for your files.  If your data has inconsistent data types, set this to `true`.  Default is
+  `true`.
+* `extractHeaders`:  When `true`, Drill will treat the first row of your data as headers.  When `false` Drill will assign column names like `field_n` for each column.
+
+### Authenticating with Google
+Once you have configured Drill to query Google Sheets, there is one final step before you can access data.  You must authenticate the application (Drill) with Google Sheets.  After you have saved your Google Sheets configuration, navigate back to the configuration screen for your plugin and click on `Authorize`. A new window should appear which will prompt you to authenticate with Google services.  Once you have done that, you should be able to query Google Sheets!  See, that wasn't so hard!
+
+### Authentication Modes
+The Google Sheets plugin supports the `SHARED_USER` and `USER_TRANSLATION` authentication modes. `SHARED_USER` is as the name implies, one user for everyone. `USER_TRANSLATION`
+uses different credentials for each individual user.  In this case, the credentials are the OAuth2.0 access tokens.
+## Querying Data
+Once you have configured Drill to connect to Google Sheets, querying is very straightforward.
+
+### Obtaining the SpreadsheetID
+The URL below is a public spreadsheet hosted on Google Sheets:
+[https://docs.google.com/spreadsheets/d/1BxiMVs0XRA5nFMdKvBdBZjgmUUqptlbs74OgvE2upms/](https://docs.google.com/spreadsheets/d/1BxiMVs0XRA5nFMdKvBdBZjgmUUqptlbs74OgvE2upms/)
+
+In this URL, the portion `1BxiMVs0XRA5nFMdKvBdBZjgmUUqptlbs74OgvE2upms` is the spreadsheetID. Thus,
+if you wanted to query this sheet in Drill, after configuring Drill, you could do so with the following
+query:
+
+```sql
+SELECT *
+FROM googlesheets.`1BxiMVs0XRA5nFMdKvBdBZjgmUUqptlbs74OgvE2upms`.`Class Data`
+```
+
+The format for the `FROM` clause for Google Sheets is:
+```sql
+FROM <plugin name>.<sheet ID>.<tab name>
+```
+Note that you must specify the tab name to successfully query Google Sheets.
+
+### Accessing Tabs by Index
+If you don't know the names of the available tabs in your GoogleSheets document, you can query the sheets by index using the `tab[n]` format.  Indexing starts at zero and every Sheets document must have at least one sheet.  Note that this must be enclosed in backticks.
+
+```sql
+SELECT *
+FROM googlesheets.<sheet id>.`tab[0]`
+```
+
+
+### Metadata
+You can obtain a list of available sheets by querying the `INFORMATION_SCHEMA` as shown below.  Assuming that you have a connection to Google Sheets called `googlesheets`:
+
+```sql
+SELECT *
+FROM `INFORMATION_SCHEMA`.`SCHEMATA`
+WHERE SCHEMA_NAME LIKE 'googlesheets%'
+```
+
+### Implicit Metadata Fields
+GoogleSheets has two implicit metadata fields which are:
+
+* `_sheets`: This will return a list of sheet (tab) names in a given GS document
+* `_title`: You can also access the file name with the `_title` field.  Note that the file name is NOT unique and should not be used for querying data.
+
+Due to rate limits from Google, the tabs are not reported to the `INFORMATION_SCHEMA`.  However, it is possible to obtain a list of all available tabs with the following query:
+
+```sql
+SELECT _sheets
+FROM googlesheets.`<token>`.`<sheet>`
+LIMIT 1
+```
+
+
+### Using Aliases
+Since the sheet IDs from Google are not human readable, one way to make your life easier is to use Drill's aliasing features to provide a better name for the actual sheet name.
+
+### Data Types
+Drill's Google Sheets reader will attempt to infer the data types of the incoming data.  As with other connectors, this is an imperfect process since Google Sheets does not
+supply a schema or other information to allow Drill to identify the data types of a column.  At present, here is how Drill will map your data:
+* Numbers:  All numeric columns will be mapped to `DOUBLE` data types
+* Boolean:  Columns containing `true/false` will be mapped to the `BOOLEAN` type
+* Time, Date, Timestamp:  Temporal fields will be mapped to the correct type.
+* Text:  Anything else will be projected as `VARCHAR`
+
+If the data type inference is not working for you, you can set the `allTextMode` to `true` and Drill will read everything as a `VARCHAR`.
+
+#### Schema Provisioning
+As with other plugins, you can provide a schema inline as shown in the example query below.
+
+```sql
+SELECT *
+FROM table(`googlesheets`.`<your google sheet>`.`MixedSheet`
+    (schema => 'inline=(`Col1` VARCHAR, `Col2` INTEGER, `Col3` VARCHAR)'))
+LIMIT 5
+```
+
+### Column Headers
+When Drill reads Google Sheets, it is assumed that the first row contains column headers.
+If this is incorrect you can set the `extractHeaders` parameter to `false`and Drill will name each field `field_n` where `n` is the column index.
+
+# Writing Data To Google Sheets
+When Drill is connected to GoogleSheets, you can also write data to Google Sheets. The basic procedure is
+the same as with any other data source.  Simply write a `CREATE TABLE AS` (CTAS) query and your data will be
+written to GoogleSheets.
+
+If you use a GoogleSheets filetoken in your CTAS query, Drill will create a new tab in that GoogleSheets document. However, if you use a file name, Drill will create a new GoogleSheets document and then create a new tab within that document.  
+
+```sql
+-- This will add a tab to an existing GoogleSheets Document
+CREATE TABLE googlesheets.`2384r7wuf2934iroeci2390ue2ur3r23948230948`.`tab_name` AS SELECT * FROM data
+
+-- This will create a new GoogleSheets Document with a single tab
+CREATE TABLE googlesheets.`new_doc`.`tab1` AS SELECT * FROM data
+
+```
+
+### Inserting (Appending) to Existing GoogleSheets Tabs
+GoogleSheets also supports inserting (appending) data to existing GoogleSheets tabs.  Syntax is:
+
+```sql
+INSERT INTO googlesheets.`<file_token>`.`<tab name>` SELECT * FROM data
+```
+
+
+### Dropping Tables
+The `DROP TABLE` command will drop a tab from a GoogleSheet document.  If the document only has one tab, the entire document will be deleted.
+
+The format for deleting a table is:
+
+```sql
+DROP TABLE googlesheets.<file_token>.<tab name>
+```
+
+# Possible Future Work
+
+### Auto-Aliasing
+As of Drill 1.20, Drill allows you to create user and public aliases for tables and storage plugins. Since Google Sheets
+requires you to use a non-human readable ID to identify the Sheet.  One possible idea to make the Drill connection to Google Sheets
+much more usable would be to automatically create an alias (either public) automatically mapping the unreadable sheetID to the document title.
+This could be accomplished after the first query or after a CTAS query.
+
+### Additional Pushdowns
+The current implementation supports pushdowns for projection and limit.
+The Google Sheets API is quite complex and incredibly poorly documented. In this author's opinion, it is quite possibly one of the worst APIs he has ever seen.
+In any event, it may be possible to add filter, sort and perhaps other pushdowns.
+The current implementation keeps the logic to push filters down to the batch reader, but does not act on these filters.
+If someone figures out how to add the filter pushdowns and wishes to do so, the query planning logic is all there.