You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@asterixdb.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2022/10/25 15:43:00 UTC
[jira] [Commented] (ASTERIXDB-3059) EXCLUDE in SelectClause

    [ https://issues.apache.org/jira/browse/ASTERIXDB-3059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17623905#comment-17623905 ] 

ASF subversion and git services commented on ASTERIXDB-3059:
------------------------------------------------------------

Commit 8249944e8d97ee94040434bec1228a3b942e3d70 in asterixdb's branch refs/heads/master from ggalvizo
[ https://gitbox.apache.org/repos/asf?p=asterixdb.git;h=8249944e8d ]

[ASTERIXDB-3059][SQL] Support for SELECT .. EXCLUDE ..

- user model changes: yes
- storage format changes: no
- interface changes: no

details:
- adding syntactic sugar rewrite to specify what fields to
  EXCLUDE in a SELECT clause
- fixing result type computer for OBJECT_REMOVE_FIELDS.
- fixing union type handling for OBJECT_REMOVE_FIELDS.

Change-Id: Ib076d130d89077964396fcbf51602488c9e90682
Reviewed-on: https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/17103
Tested-by: Jenkins <je...@fulliautomatix.ics.uci.edu>
Integration-Tests: Jenkins <je...@fulliautomatix.ics.uci.edu>
Reviewed-by: Glenn Galvizo <gg...@uci.edu>


> EXCLUDE in SelectClause
> -----------------------
>
>                 Key: ASTERIXDB-3059
>                 URL: https://issues.apache.org/jira/browse/ASTERIXDB-3059
>             Project: Apache AsterixDB
>          Issue Type: New Feature
>          Components: SQL - Translator SQL++
>            Reporter: Glenn Justo Galvizo
>            Assignee: Glenn Justo Galvizo
>            Priority: Major
>         Attachments: Projection.png, SelectClause.png
>
>
> It would be nice to have an option to exclude certain fields when using SELECT *. [BigQuery|https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#select_except] gives users the option to exclude columns with 'SELECT * EXCEPT f1, f2, ...'. The 'EXCEPT' concept seems like a nice addition to our grammar, here are a few examples on how we can extend our SELECT clause to include 'EXCLUDE' (used in lieu of 'EXCEPT' to avoid confusion with the set operation):
>  
> Let's start off with some DDLs and some data:
> {code:sql}
> CREATE TYPE     GenericType AS { _id: uuid };
> CREATE DATASET  Users (GenericType) PRIMARY KEY _id AUTOGENERATED;
> CREATE DATASET  Friends (GenericType) PRIMARY KEY _id AUTOGENERATED;
> INSERT INTO     Users [
>   { "user_id": 1, "name": "Sally",
>     "address": { "zip_code": "92512", "state": "CA" } },
>   { "user_id": 2, "name": "Mark" },
>   { "user_id": 3, "name": "John" },
>   { "user_id": 4, "name": "Tracy", "title": "Dr" }
> ];
> INSERT INTO     Friends [
>   { "user_id": 1, "friend": 2 },
>   { "user_id": 1, "friend": 3 },
>   { "user_id": 2, "friend": 1 },
>   { "user_id": 3, "friend": 1 }
> ];{code}
> ----
>  
> Now we can get into the examples:
> 1. Get me all users U1 that have friends. Do not include the F dataset. In the example below, we have a 'SELECT *' followed by an 'EXCLUDE ( F )'. In contrast to BigQuery, we do not specify columns / fields in our exclusion list here, instead we specify an exclusion of _variables_ that would have otherwise been returned with our 'SELECT' clause.
> {code:sql}
> FROM   Users U1, Friends F
> WHERE  U1.user_id = F.user_id
> SELECT * EXCLUDE F;{code}
> Returns the following (Sally appears twice because she has two friends):
> {code:java}
> > { "U1":  { "user_id": 1, "name": "Sally",
>              "address": { "zip_code": "92512", "state": "CA" } } }
> > { "U1": { "user_id": 1, "name": "Sally", 
>             "address": { "zip_code": "92512", "state": "CA" } } }
> > { "U1": { "user_id": 2, "name": "Mark" } }
> > { "U1": { "user_id": 3, "name": "John" } }{code}
> 2. Get me all unique users U1 that have friends. Do not include that user's address and title. The example below more closely aligns with BigQuery, as we specify columns / fields that we want to exclude from U1 in the exclusion list 'address' and 'title'.
> {code:sql}
> FROM   Users U1, Friends F
> WHERE  U1.user_id = F.user_id 
> SELECT DISTINCT U1.* EXCLUDE address, title;{code}
> Returns the following:
> {code:java}
> > { "user_id": 1, "name": "Sally" }
> > { "user_id": 2, "name": "Mark" }
> > { "user_id": 3, "name": "John" }{code}
> 3. If we want to exclude some field in a nested object, we can specify the fields within our object with the '.' syntax. Say we don't want to include the zipcode of an address object within a user that has friends. We would use the query below:
> {code:sql}
> FROM   Users U1, Friends F
> WHERE  U1.user_id = F.user_id
> SELECT DISTINCT U1.* EXCLUDE address.zip_code;{code}
> Returns the following:
> {code:java}
> > { "user_id": 1, "name": "Sally", "address": { "state": "CA" } }
> > { "user_id": 2, "name": "Mark" }
> > { "user_id": 3, "name": "John" }{code}
> 4. Now suppose we want users and the user documents of their immediate friends, excluding addresses from both user documents. We would use the following query:
> {code:sql}
> FROM   Users U1, Friends F, Users U2
> WHERE  U1.user_id = F.user_id AND
>        U2.user_id = F.friend
> SELECT U1, U2 EXCLUDE U1.address, U2.address;{code}
> Returns the following:
> {code:java}
> > { "U1": { "user_id": 1, "name": "Sally" }, 
>     "U2": { "user_id": 2, "name": "Mark" } }
> > { "U1": { "user_id": 1, "name": "Sally" }, 
>     "U2": { "user_id": 3, "name": "John" } }
> > { "U1": { "user_id": 2, "name": "Mark" }, 
>     "U2": { "user_id": 1, "name": "Sally" } }
> > { "U1": { "user_id": 3, "name": "John" }, 
>     "U2": { "user_id": 1, "name": "Sally" } }{code}
> 5. To expand on what the EXCLUDE sees, our EXCLUDE sees what our SELECT clause returns. In the case of SELECT DISTINCT however ..., our DISTINCT applies after our EXCLUDE. For example, suppose we wanted to remove a field from the key of our Friends document:
> {code:sql}
> FROM   Friends F
> SELECT DISTINCT F EXCLUDE F.friend;{code}
> Returns the following:
> {code:java}
> > { "user_id": 1 }
> > { "user_id": 2 }
> > { "user_id": 3 }{code}
> Note that there are four unique Friends document, but exclusion of the friend field generates two user_id = 1 documents which the DISTINCT subsequently prunes.
> 6. We support a special use-case for single collection FROM-CLAUSEs that goes against the "EXCLUDE sees what the SELECT returns" semantics, where the items in the EXCLUDE list will be qualified with the single collection's variable. This rule fires with the following conditions:
>  * We have a SELECT * (we rule out SELECT var and SELECT var.*).
>  * We have a single FROM-TERM in our FROM-CLAUSE that contains a single collection (no JOINs, UNNESTs, NESTs, or GROUP-BYs).
> We have this special case to more closely align with how names are resolved in single-collection queries. For example:
> {code:sql}
> FROM   Users 
> WHERE  name = 'Sally'
> SELECT * EXCLUDE Users.name, address.zip_code, title;{code}
> is expanded to:
> {code:sql}
> FROM   Users AS Users
> WHERE  Users.name = 'Sally'
> SELECT * EXCLUDE Users.name, Users.address.zip_code, Users.title;{code}
> Note that the "Users.name" does not expand to "Users.Users.name" because the "Users" in "Users.name" serves as the connector / anchor. In "address.zip_code" and "title", there exists no such connectors / anchor so we qualify our path with the dataset variable "Users".
> ----
>  
> Now let's see some "problematic" queries:
> 1. U3 is not defined anywhere in the query, but the EXCLUDE sees U3 as a potential field from the SELECT clause. For now we are going to execute our query without warnings, but it would be nice to raise a warning here in the future (otherwise, we would have to implement some form of type analysis into the lang module).
> {code:sql}
> FROM   Users U1, Friends F
> WHERE  U1.user_id = F.user_id
> SELECT U1 EXCLUDE U3;{code}
> 2. Similar to 1, our EXCLUDE clause is extraneous. F is not visible (using EXCLUDE sees what the SELECT returns semantics). We are going to execute our query without warnings, but it would be nice to raise a warning here in the future:
> {code:sql}
> FROM   Users U1, Friends F
> WHERE  U1.user_id = F.user_id
> SELECT U1 EXCLUDE F;{code}
> 3. Looking at our data, we do not have any documents that have a ssn field. But, we (currently) cannot infer the absence of an ssn field, so the query below should run without warnings / errors.
> {code:sql}
> FROM   Users U1, Friends F
> WHERE  U1.user_id = F.user_id
> SELECT U1 EXCLUDE U1.ssn;{code}
> 4. (Raised by Don Chamberlin): In all of the previous examples, we could connect each field / nested field in the EXCLUDE list using the first identifier in the field / nested field itself to some identifier in the SELECT list. With SELECT VALUE, we have no such identifier. Hence, we do not allow SELECT VALUE and EXCLUDE to exist together to make it clearer what each field in our EXCLUDE list refers to.
> {code:sql}
> FROM   Users U1
> SELECT VALUE U1.address EXCLUDE zip_code;{code}
> ----
>  
> The exact grammar I have in mind is given below, where we simply add the 'EXCLUDE' at the tail of our 'SELECT' clause. Semantically, the 'EXCLUDE' clause surrounds our SELECT-EXPR with a call to OBJECT_REMOVE_FIELDS. Using #2 as an example, a logical equivalent would be:
> {code:sql}
> FROM   (
>   FROM   Users U1, Friends F
>   WHERE  U1.user_id = F.user_id
>   SELECT U1.* // DISTINCT gets pushed to outer query.
> ) TMP
> SELECT DISTINCT VALUE OBJECT_REMOVE_FIELDS(TMP, ["address", "title"]);{code}
>  
> Glancing at the grammar, one may notice that our EXCLUDE list does not include generalized path expressions (i.e. no array access). Suppose a user had an array of phone numbers documents. With generalized path expressions, one could create a query like:
> {code:sql}
> FROM    Users U1 
> SELECT  U1.phone_numbers EXCLUDE phone_numbers[1], 
>                                  phone_numbers[0].kind,
>                                  phone_numbers[2:4];{code}
> While such support would be nice to add in the future, adding array access requires an entirely new function runtime to be made. We can currently get away with just using an existing function (OBJECT_REMOVE_FIELDS) to support fields and fields nested in objects. We can revisit array access in another issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)