You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by "vtlim (via GitHub)" <gi...@apache.org> on 2023/03/29 22:52:28 UTC

[GitHub] [druid] vtlim commented on a diff in pull request #13736: docs: sql unnest and cleanup unnest datasource

vtlim commented on code in PR #13736:
URL: https://github.com/apache/druid/pull/13736#discussion_r1152555855


##########
docs/querying/sql.md:
##########
@@ -82,6 +83,43 @@ FROM clause, metadata tables are not considered datasources. They exist only in
 For more information about table, lookup, query, and join datasources, refer to the [Datasources](datasource.md)
 documentation.
 
+## UNNEST
+
+> The UNNEST SQL function is [experimental](../development/experimental.md). Their API and behavior are subject
+> to change in future releases. It is not recommended to use this feature in production at this time.
+
+The UNNEST clause unnests array values. It's the SQL equivalent to the [unnest datasource](./datasource.md#unnest). The source for UNNEST can be an array or an input that's been transformed into an array, such as with helper functions like MV_TO_ARRAY or ARRAY.
+
+The following is the general syntax for UNNEST, specifically a query that returns the column that gets unnested:
+
+```sql
+SELECT column_alias_name FROM datasource, UNNEST(source_expression1) AS table_alias_name1(column_alias_name1), UNNEST(source_expression2) AS table_alias_name2(column_alias_name2), ...
+```
+
+* The `datasource` for UNNEST can be any Druid datasource, such as the following:
+  * A table, such as  `FROM a_table`.
+  * A subset of a table based on a query, a filter, or a JOIN. For example, `FROM (SELECT columnA,columnB,columnC from a_table)`.
+* The `source_expression` for the UNNEST function must be an array and can come from any expression. If the dimension you are unnesting is a multi-value dimension, you have to specify `MV_TO_ARRAY(dimension)` to convert it to an implicit ARRAY type. You can also specify any expression that has an SQL array datatype. For example, you can call UNNEST on the following:
+  * `ARRAY[dim1,dim2]` if you want to make an array out of two dimensions. 
+  * `ARRAY_CONCAT(dim1,dim2)` if you have to concatenate two multi-value dimensions. 

Review Comment:
   ```suggestion
     * `ARRAY_CONCAT(dim1,dim2)` if you want to concatenate two multi-value dimensions. 
   ```



##########
docs/querying/sql.md:
##########
@@ -82,6 +83,43 @@ FROM clause, metadata tables are not considered datasources. They exist only in
 For more information about table, lookup, query, and join datasources, refer to the [Datasources](datasource.md)
 documentation.
 
+## UNNEST
+
+> The UNNEST SQL function is [experimental](../development/experimental.md). Their API and behavior are subject

Review Comment:
   ```suggestion
   > The UNNEST SQL function is [experimental](../development/experimental.md). Its API and behavior are subject
   ```



##########
docs/querying/sql.md:
##########
@@ -82,6 +83,43 @@ FROM clause, metadata tables are not considered datasources. They exist only in
 For more information about table, lookup, query, and join datasources, refer to the [Datasources](datasource.md)
 documentation.
 
+## UNNEST
+
+> The UNNEST SQL function is [experimental](../development/experimental.md). Their API and behavior are subject
+> to change in future releases. It is not recommended to use this feature in production at this time.
+
+The UNNEST clause unnests array values. It's the SQL equivalent to the [unnest datasource](./datasource.md#unnest). The source for UNNEST can be an array or an input that's been transformed into an array, such as with helper functions like MV_TO_ARRAY or ARRAY.
+
+The following is the general syntax for UNNEST, specifically a query that returns the column that gets unnested:
+
+```sql
+SELECT column_alias_name FROM datasource, UNNEST(source_expression1) AS table_alias_name1(column_alias_name1), UNNEST(source_expression2) AS table_alias_name2(column_alias_name2), ...
+```
+
+* The `datasource` for UNNEST can be any Druid datasource, such as the following:
+  * A table, such as  `FROM a_table`.
+  * A subset of a table based on a query, a filter, or a JOIN. For example, `FROM (SELECT columnA,columnB,columnC from a_table)`.
+* The `source_expression` for the UNNEST function must be an array and can come from any expression. If the dimension you are unnesting is a multi-value dimension, you have to specify `MV_TO_ARRAY(dimension)` to convert it to an implicit ARRAY type. You can also specify any expression that has an SQL array datatype. For example, you can call UNNEST on the following:
+  * `ARRAY[dim1,dim2]` if you want to make an array out of two dimensions. 
+  * `ARRAY_CONCAT(dim1,dim2)` if you have to concatenate two multi-value dimensions. 
+* The `AS table_alias_name(column_alias_name)` clause  is not required but is highly recommended. Use it to specify the output, which can be an existing column or a new one. Replace `table_alias_name` and `column_alias_name` with a table and column name you want to alias the unnested results to. If you don't provide this, Druid uses a nondescriptive name, such as `EXPR$0`.
+
+Keep these two things in mind when writing your query:
+
+- You must include the context parameter `"enableUnnest": true`.
+- You can unnest multiple source expressions in a single query.
+- Notice the comma between the datasource and the UNNEST function. This is needed in most cases of the UNNEST function. Specifically, it is not needed when you're unnesting an inline array since the array itself is the datasource.
+- If you view the native explanation of a SQL UNNEST, you'll notice that Druid uses `j0.unnest` as a virtual column to perform the unnest. An underscore is added for each unnest, so you may notice virtual columns named `_j0.unnest` or `__j0.unnest`.
+
+For examples, see the [Unnest arrays tutorial](../tutorials/tutorial-unnest-arrays.md).
+
+The UNNEST function has the following limitations:
+
+- The function does not remove any duplicates or nulls in an array. Nulls will be treated as any other value in an array. If there are multiple nulls within the array, a record corresponding to each of the nulls gets created.
+- Arrays inside complex JSON types are not supported.
+- You cannot perform an UNNEST at ingestion time, including SQL-based ingestion using the MSQ task engine.
+- UNNEST preserves the ordering in the source array that is being unnested.

Review Comment:
   Is this a limitation or rather just a description of how UNNEST functions?



##########
docs/tutorials/tutorial-unnest-arrays.md:
##########
@@ -141,12 +145,128 @@ Now that the data is loaded, run the following query:
 SELECT * FROM nested_data
 ```
 
-In the results, notice that the column named `dim3` has nested values like `["a","b"]`.  The example queries that follow unnest `dim3`  and run queries, such as Scan. 
+In the results, notice that the column named `dim3` has nested values like `["a","b"]`.  The example queries that follow unnest `dim3`  and run queries against the unnested records. Depending on the type of queries you write, see either [Unnest using SQL queries](#unnest-using-sql-queries) or [Unnest using native queries](#unnest-using-native-queries).
 
-## Unnest a single column
+## Unnest using SQL queries
+
+The following is the general syntax for UNNEST:
+
+```sql
+SELECT column_alias_name FROM datasource, UNNEST(source_expression) AS table_alias_name(column_alias_name)
+```
+
+In addition, you must supply the following context parameter:
+
+```json
+"enableUnnest": "true"
+```
+
+For more information about the syntax, see [UNNEST](../querying/sql.md#unnest).
+
+### Unnest a single source expression in a datasource
+
+The following query returns a column called `d3` from the table `nested_data`. `d3` contains the unnested values from the source column `dim3`:
+
+```sql
+SELECT d3 FROM "nested_data", UNNEST(MV_TO_ARRAY(dim3)) AS example_table(d3) 
+```
+
+Notice the MV_TO_ARRAY helper function, which converts the multi-value records in `dim3` to arrays. It is required since `dim3` is a multi-value string dimension. 
+
+If the column you are unnesting is not a string dimension, then you do not need to use the MV_TO_ARRAY helper function.
+
+### Unnest a virtual column
+
+You can unnest into a virtual column (multiple columns treated as one). The following query returns the two source columns and a third virtual column containing the unnested data:
+
+```sql
+SELECT dim4,dim5,d45 FROM nested_data, UNNEST(ARRAY[dim4,dim5]) AS example_table(d45)
+```
+
+This virtual column is the product of the two source columns. Notice how the total number of rows has grown. The table `nested_data` had only seven rows originally.
+
+Another way to unnest a virtual column is to concatenate them with ARRAY_CONCAT:
+
+```sql
+SELECT dim4,dim5,d45 FROM nested_data, UNNEST(ARRAY_CONCAT(dim4,dim5)) AS example_table(d45)
+```
+
+Decide which method to use based on what your goals are. 
+
+### Unnest multiple source expressions
+
+You can include multiple UNNEST clauses in a single query. Each `UNNEST` clause needs the following:
+
+```sql
+UNNEST(source_expression) AS table_alias_name(column_alias_name)
+```
+
+The `table_alias_name` and `column_alias_name` for each UNNEST clause should be unique.
+
+The example query returns the following from  the `nested_data` datasource:
+
+- the source columns `dim3`, `dim4`, and `dim5`
+- an unnested version of `dim3` aliased to `d3`
+- an unnested virtual column composed of `dim4` and `dim5` aliased to `d45`
+
+```sql
+SELECT dim3,dim4,dim5,d3,d45 FROM "nested_data", UNNEST(MV_TO_ARRAY("dim3")) AS ud(d3), UNNEST(ARRAY[dim4,dim5]) AS foo(d45)

Review Comment:
   ```suggestion
   SELECT dim3,dim4,dim5,d3,d45 FROM "nested_data", UNNEST(MV_TO_ARRAY("dim3")) AS foo1(d3), UNNEST(ARRAY[dim4,dim5]) AS foo2(d45)
   ```



##########
docs/tutorials/tutorial-unnest-arrays.md:
##########
@@ -141,12 +145,128 @@ Now that the data is loaded, run the following query:
 SELECT * FROM nested_data
 ```
 
-In the results, notice that the column named `dim3` has nested values like `["a","b"]`.  The example queries that follow unnest `dim3`  and run queries, such as Scan. 
+In the results, notice that the column named `dim3` has nested values like `["a","b"]`.  The example queries that follow unnest `dim3`  and run queries against the unnested records. Depending on the type of queries you write, see either [Unnest using SQL queries](#unnest-using-sql-queries) or [Unnest using native queries](#unnest-using-native-queries).
 
-## Unnest a single column
+## Unnest using SQL queries
+
+The following is the general syntax for UNNEST:
+
+```sql
+SELECT column_alias_name FROM datasource, UNNEST(source_expression) AS table_alias_name(column_alias_name)
+```
+
+In addition, you must supply the following context parameter:
+
+```json
+"enableUnnest": "true"
+```
+
+For more information about the syntax, see [UNNEST](../querying/sql.md#unnest).
+
+### Unnest a single source expression in a datasource
+
+The following query returns a column called `d3` from the table `nested_data`. `d3` contains the unnested values from the source column `dim3`:
+
+```sql
+SELECT d3 FROM "nested_data", UNNEST(MV_TO_ARRAY(dim3)) AS example_table(d3) 
+```
+
+Notice the MV_TO_ARRAY helper function, which converts the multi-value records in `dim3` to arrays. It is required since `dim3` is a multi-value string dimension. 
+
+If the column you are unnesting is not a string dimension, then you do not need to use the MV_TO_ARRAY helper function.
+
+### Unnest a virtual column
+
+You can unnest into a virtual column (multiple columns treated as one). The following query returns the two source columns and a third virtual column containing the unnested data:
+
+```sql
+SELECT dim4,dim5,d45 FROM nested_data, UNNEST(ARRAY[dim4,dim5]) AS example_table(d45)
+```
+
+This virtual column is the product of the two source columns. Notice how the total number of rows has grown. The table `nested_data` had only seven rows originally.
+
+Another way to unnest a virtual column is to concatenate them with ARRAY_CONCAT:
+
+```sql
+SELECT dim4,dim5,d45 FROM nested_data, UNNEST(ARRAY_CONCAT(dim4,dim5)) AS example_table(d45)
+```
+
+Decide which method to use based on what your goals are. 
+
+### Unnest multiple source expressions
+
+You can include multiple UNNEST clauses in a single query. Each `UNNEST` clause needs the following:
+
+```sql
+UNNEST(source_expression) AS table_alias_name(column_alias_name)
+```
+
+The `table_alias_name` and `column_alias_name` for each UNNEST clause should be unique.
+
+The example query returns the following from  the `nested_data` datasource:
+
+- the source columns `dim3`, `dim4`, and `dim5`
+- an unnested version of `dim3` aliased to `d3`
+- an unnested virtual column composed of `dim4` and `dim5` aliased to `d45`
+
+```sql
+SELECT dim3,dim4,dim5,d3,d45 FROM "nested_data", UNNEST(MV_TO_ARRAY("dim3")) AS ud(d3), UNNEST(ARRAY[dim4,dim5]) AS foo(d45)
+```
+
+
+### Unnest a column from a subset of a table
+
+The following query uses only three columns from the `nested_data` table as the datasource. From that subset, it unnests the column `dim3` into `d3` and returns `d3`.
+
+```sql
+SELECT d3 FROM (select dim1, dim2, dim3 from "nested_data"), UNNEST(MV_TO_ARRAY(dim3)) AS example_table(d3)

Review Comment:
   ```suggestion
   SELECT d3 FROM (SELECT dim1, dim2, dim3 FROM "nested_data"), UNNEST(MV_TO_ARRAY(dim3)) AS example_table(d3)
   ```



##########
docs/tutorials/tutorial-unnest-arrays.md:
##########
@@ -141,12 +145,128 @@ Now that the data is loaded, run the following query:
 SELECT * FROM nested_data
 ```
 
-In the results, notice that the column named `dim3` has nested values like `["a","b"]`.  The example queries that follow unnest `dim3`  and run queries, such as Scan. 
+In the results, notice that the column named `dim3` has nested values like `["a","b"]`.  The example queries that follow unnest `dim3`  and run queries against the unnested records. Depending on the type of queries you write, see either [Unnest using SQL queries](#unnest-using-sql-queries) or [Unnest using native queries](#unnest-using-native-queries).
 
-## Unnest a single column
+## Unnest using SQL queries
+
+The following is the general syntax for UNNEST:
+
+```sql
+SELECT column_alias_name FROM datasource, UNNEST(source_expression) AS table_alias_name(column_alias_name)
+```
+
+In addition, you must supply the following context parameter:
+
+```json
+"enableUnnest": "true"
+```
+
+For more information about the syntax, see [UNNEST](../querying/sql.md#unnest).
+
+### Unnest a single source expression in a datasource
+
+The following query returns a column called `d3` from the table `nested_data`. `d3` contains the unnested values from the source column `dim3`:
+
+```sql
+SELECT d3 FROM "nested_data", UNNEST(MV_TO_ARRAY(dim3)) AS example_table(d3) 
+```
+
+Notice the MV_TO_ARRAY helper function, which converts the multi-value records in `dim3` to arrays. It is required since `dim3` is a multi-value string dimension. 
+
+If the column you are unnesting is not a string dimension, then you do not need to use the MV_TO_ARRAY helper function.
+
+### Unnest a virtual column
+
+You can unnest into a virtual column (multiple columns treated as one). The following query returns the two source columns and a third virtual column containing the unnested data:
+
+```sql
+SELECT dim4,dim5,d45 FROM nested_data, UNNEST(ARRAY[dim4,dim5]) AS example_table(d45)
+```
+
+This virtual column is the product of the two source columns. Notice how the total number of rows has grown. The table `nested_data` had only seven rows originally.
+
+Another way to unnest a virtual column is to concatenate them with ARRAY_CONCAT:
+
+```sql
+SELECT dim4,dim5,d45 FROM nested_data, UNNEST(ARRAY_CONCAT(dim4,dim5)) AS example_table(d45)
+```
+
+Decide which method to use based on what your goals are. 
+
+### Unnest multiple source expressions
+
+You can include multiple UNNEST clauses in a single query. Each `UNNEST` clause needs the following:
+
+```sql
+UNNEST(source_expression) AS table_alias_name(column_alias_name)
+```
+
+The `table_alias_name` and `column_alias_name` for each UNNEST clause should be unique.
+
+The example query returns the following from  the `nested_data` datasource:
+
+- the source columns `dim3`, `dim4`, and `dim5`
+- an unnested version of `dim3` aliased to `d3`
+- an unnested virtual column composed of `dim4` and `dim5` aliased to `d45`
+
+```sql
+SELECT dim3,dim4,dim5,d3,d45 FROM "nested_data", UNNEST(MV_TO_ARRAY("dim3")) AS ud(d3), UNNEST(ARRAY[dim4,dim5]) AS foo(d45)
+```
+
+
+### Unnest a column from a subset of a table
+
+The following query uses only three columns from the `nested_data` table as the datasource. From that subset, it unnests the column `dim3` into `d3` and returns `d3`.
+
+```sql
+SELECT d3 FROM (select dim1, dim2, dim3 from "nested_data"), UNNEST(MV_TO_ARRAY(dim3)) AS example_table(d3)
+```
+
+### Unnest with a filter
+
+You can specify which rows to unnest by including a filter in your query. The following query:
+
+* Filters the source expression based on `dim2`
+* Unnests the records in `dim3` into `d3` 
+* Returns the records for  the unnested `d3` that have a `dim2` record that matches the filter
+
+```sql
+SELECT d3 FROM (SELECT * FROM nested_data WHERE dim2 IN ('abc')), UNNEST(MV_TO_ARRAY(dim3)) AS example_table(d3)
+```
+
+You can also filter the results of an UNNEST clause. The following example unnests the inline array `[1,2,3]` but only returns the rows that match the filter:
+
+```sql
+SELECT * FROM UNNEST(ARRAY[1,2,3]) AS ud(d1) WHERE d1 IN ('1','2')

Review Comment:
   ```suggestion
   SELECT * FROM UNNEST(ARRAY[1,2,3]) AS example_table(d1) WHERE d1 IN ('1','2')
   ```
   Matches the example above



##########
docs/querying/sql.md:
##########
@@ -82,6 +83,43 @@ FROM clause, metadata tables are not considered datasources. They exist only in
 For more information about table, lookup, query, and join datasources, refer to the [Datasources](datasource.md)
 documentation.
 
+## UNNEST
+
+> The UNNEST SQL function is [experimental](../development/experimental.md). Their API and behavior are subject
+> to change in future releases. It is not recommended to use this feature in production at this time.
+
+The UNNEST clause unnests array values. It's the SQL equivalent to the [unnest datasource](./datasource.md#unnest). The source for UNNEST can be an array or an input that's been transformed into an array, such as with helper functions like MV_TO_ARRAY or ARRAY.
+
+The following is the general syntax for UNNEST, specifically a query that returns the column that gets unnested:
+
+```sql
+SELECT column_alias_name FROM datasource, UNNEST(source_expression1) AS table_alias_name1(column_alias_name1), UNNEST(source_expression2) AS table_alias_name2(column_alias_name2), ...
+```
+
+* The `datasource` for UNNEST can be any Druid datasource, such as the following:
+  * A table, such as  `FROM a_table`.
+  * A subset of a table based on a query, a filter, or a JOIN. For example, `FROM (SELECT columnA,columnB,columnC from a_table)`.
+* The `source_expression` for the UNNEST function must be an array and can come from any expression. If the dimension you are unnesting is a multi-value dimension, you have to specify `MV_TO_ARRAY(dimension)` to convert it to an implicit ARRAY type. You can also specify any expression that has an SQL array datatype. For example, you can call UNNEST on the following:
+  * `ARRAY[dim1,dim2]` if you want to make an array out of two dimensions. 
+  * `ARRAY_CONCAT(dim1,dim2)` if you have to concatenate two multi-value dimensions. 
+* The `AS table_alias_name(column_alias_name)` clause  is not required but is highly recommended. Use it to specify the output, which can be an existing column or a new one. Replace `table_alias_name` and `column_alias_name` with a table and column name you want to alias the unnested results to. If you don't provide this, Druid uses a nondescriptive name, such as `EXPR$0`.
+
+Keep these two things in mind when writing your query:

Review Comment:
   ```suggestion
   Keep the following things in mind when writing your query:
   ```



##########
docs/tutorials/tutorial-unnest-arrays.md:
##########
@@ -141,12 +145,128 @@ Now that the data is loaded, run the following query:
 SELECT * FROM nested_data
 ```
 
-In the results, notice that the column named `dim3` has nested values like `["a","b"]`.  The example queries that follow unnest `dim3`  and run queries, such as Scan. 
+In the results, notice that the column named `dim3` has nested values like `["a","b"]`.  The example queries that follow unnest `dim3`  and run queries against the unnested records. Depending on the type of queries you write, see either [Unnest using SQL queries](#unnest-using-sql-queries) or [Unnest using native queries](#unnest-using-native-queries).
 
-## Unnest a single column
+## Unnest using SQL queries
+
+The following is the general syntax for UNNEST:
+
+```sql
+SELECT column_alias_name FROM datasource, UNNEST(source_expression) AS table_alias_name(column_alias_name)
+```
+
+In addition, you must supply the following context parameter:
+
+```json
+"enableUnnest": "true"
+```
+
+For more information about the syntax, see [UNNEST](../querying/sql.md#unnest).
+
+### Unnest a single source expression in a datasource
+
+The following query returns a column called `d3` from the table `nested_data`. `d3` contains the unnested values from the source column `dim3`:
+
+```sql
+SELECT d3 FROM "nested_data", UNNEST(MV_TO_ARRAY(dim3)) AS example_table(d3) 
+```
+
+Notice the MV_TO_ARRAY helper function, which converts the multi-value records in `dim3` to arrays. It is required since `dim3` is a multi-value string dimension. 
+
+If the column you are unnesting is not a string dimension, then you do not need to use the MV_TO_ARRAY helper function.
+
+### Unnest a virtual column
+
+You can unnest into a virtual column (multiple columns treated as one). The following query returns the two source columns and a third virtual column containing the unnested data:
+
+```sql
+SELECT dim4,dim5,d45 FROM nested_data, UNNEST(ARRAY[dim4,dim5]) AS example_table(d45)
+```
+
+This virtual column is the product of the two source columns. Notice how the total number of rows has grown. The table `nested_data` had only seven rows originally.

Review Comment:
   ```suggestion
   The virtual column `d45` is the product of the two source columns. Notice how the total number of rows has grown. The table `nested_data` had only seven rows originally.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org