You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by jo...@apache.org on 2019/07/03 15:22:45 UTC
[incubator-druid] branch master updated: add SQL docs for multi-value string dimensions (#8011)

This is an automated email from the ASF dual-hosted git repository.

jonwei pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-druid.git


The following commit(s) were added to refs/heads/master by this push:
     new 3b84246  add SQL docs for multi-value string dimensions (#8011)
3b84246 is described below

commit 3b84246cd6bbc799915c53fda2659b70ea227eb8
Author: Clint Wylie <cw...@apache.org>
AuthorDate: Wed Jul 3 08:22:33 2019 -0700

    add SQL docs for multi-value string dimensions (#8011)
    
    * add SQL docs for multi-value string dimensions
    
    * formatting consistency
    
    * fix typo
    
    * adjust
---
 docs/content/misc/math-expr.md                  |  59 ++++++-----
 docs/content/querying/multi-value-dimensions.md |  28 ++++--
 docs/content/querying/sql.md                    | 126 +++++++++++++++---------
 3 files changed, 126 insertions(+), 87 deletions(-)

diff --git a/docs/content/misc/math-expr.md b/docs/content/misc/math-expr.md
index 2f01069..3fab03a 100644
--- a/docs/content/misc/math-expr.md
+++ b/docs/content/misc/math-expr.md
@@ -54,14 +54,14 @@ begin with a digit. To escape other special characters, you can quote it with do
 For logical operators, a number is true if and only if it is positive (0 or negative value means false). For string
 type, it's the evaluation result of 'Boolean.valueOf(string)'.
 
-Multi-value string dimensions are supported and may be treated as either scalar or array typed values. When treated as
-a scalar type, an expression will automatically be transformed to apply the scalar operation across all values of the
-multi-valued type, to mimic Druid's native behavior. Values that result in arrays will be coerced back into the native
-Druid string type for aggregation. Druid aggregations on multi-value string dimensions on the individual values, _not_
-the 'array', behaving similar to the `unnest` operator available in many SQL dialects. However, by using the
-`array_to_string` function, aggregations may be done on a stringified version of the complete array, allowing the
-complete row to be preserved. Using `string_to_array` in an expression post-aggregator, allows transforming the
-stringified dimension back into the true native array type.
+[Multi-value string dimensions](../querying/multi-value-dimensions.html) are supported and may be treated as either
+scalar or array typed values. When treated as a scalar type, an expression will automatically be transformed to apply
+the scalar operation across all values of the multi-valued type, to mimic Druid's native behavior. Values that result in
+arrays will be coerced back into the native Druid string type for aggregation. Druid aggregations on multi-value string
+dimensions on the individual values, _not_ the 'array', behaving similar to the `UNNEST` operator available in many SQL
+dialects. However, by using the `array_to_string` function, aggregations may be done on a stringified version of the
+complete array, allowing the complete row to be preserved. Using `string_to_array` in an expression post-aggregator,
+allows transforming the stringified dimension back into the true native array type.
 
 
 The following built-in functions are available.
@@ -168,31 +168,30 @@ See javadoc of java.lang.Math for detailed explanation for each function.
 
 | function | description |
 | --- | --- |
-| `array(expr1,expr ...)` | constructs an array from the expression arguments, using the type of the first argument as the output array type |
-| `array_length(arr)` | returns length of array expression |
-| `array_offset(arr,long)` | returns the array element at the 0 based index supplied, or null for an out of range index|
-| `array_ordinal(arr,long)` | returns the array element at the 1 based index supplied, or null for an out of range index |
-| `array_contains(arr,expr)` | returns 1 if the array contains the element specified by expr, or contains all elements specified by expr if expr is an array, else 0 |
-| `array_overlap(arr1,arr2)` | returns 1 if arr1 and arr2 have any elements in common, else 0 |
-| `array_offset_of(arr,expr)` | returns the 0 based index of the first occurrence of expr in the array, or `null` if no matching elements exist in the array. |
-| `array_ordinal_of(arr,expr)` | returns the 1 based index of the first occurrence of expr in the array, or `null` if no matching elements exist in the array. |
-| `array_prepend(expr,arr)` | adds expr to arr at the beginning, the resulting array type determined by the type of the array |
-| `array_append(arr1,expr)` | appends expr to arr, the resulting array type determined by the type of the first array |
-| `array_concat(arr1,arr2)` | concatenates 2 arrays, the resulting array type determined by the type of the first array |
-| `array_slice(arr,start,end)` | return the subarray of arr from the 0 based index start(inclusive) to end(exclusive), or `null`, if start is less than 0, greater than length of arr or less than end|
-| `array_to_string(arr,str)` | joins all elements of arr by the delimiter specified by str |
-| `string_to_array(str1,str2)` | splits str1 into an array on the delimiter specified by str2 |
-
+| array(expr1,expr ...) | constructs an array from the expression arguments, using the type of the first argument as the output array type |
+| array_length(arr) | returns length of array expression |
+| array_offset(arr,long) | returns the array element at the 0 based index supplied, or null for an out of range index|
+| array_ordinal(arr,long) | returns the array element at the 1 based index supplied, or null for an out of range index |
+| array_contains(arr,expr) | returns 1 if the array contains the element specified by expr, or contains all elements specified by expr if expr is an array, else 0 |
+| array_overlap(arr1,arr2) | returns 1 if arr1 and arr2 have any elements in common, else 0 |
+| array_offset_of(arr,expr) | returns the 0 based index of the first occurrence of expr in the array, or `-1` or `null` if `druid.generic.useDefaultValueForNull=false`if no matching elements exist in the array. |
+| array_ordinal_of(arr,expr) | returns the 1 based index of the first occurrence of expr in the array, or `-1` or `null` if `druid.generic.useDefaultValueForNull=false` if no matching elements exist in the array. |
+| array_prepend(expr,arr) | adds expr to arr at the beginning, the resulting array type determined by the type of the array |
+| array_append(arr1,expr) | appends expr to arr, the resulting array type determined by the type of the first array |
+| array_concat(arr1,arr2) | concatenates 2 arrays, the resulting array type determined by the type of the first array |
+| array_slice(arr,start,end) | return the subarray of arr from the 0 based index start(inclusive) to end(exclusive), or `null`, if start is less than 0, greater than length of arr or less than end|
+| array_to_string(arr,str) | joins all elements of arr by the delimiter specified by str |
+| string_to_array(str1,str2) | splits str1 into an array on the delimiter specified by str2 |
 
 
 ## Apply Functions
 
 | function | description |
 | --- | --- |
-| `map(lambda,arr)` | applies a transform specified by a single argument lambda expression to all elements of arr, returning a new array |
-| `cartesian_map(lambda,arr1,arr2,...)` | applies a transform specified by a multi argument lambda expression to all elements of the cartesian product of all input arrays, returning a new array; the number of lambda arguments and array inputs must be the same |
-| `filter(lambda,arr)` | filters arr by a single argument lambda, returning a new array with all matching elements, or null if no elements match |
-| `fold(lambda,arr)` | folds a 2 argument lambda across arr. The first argument of the lambda is the array element and the second the accumulator, returning a single accumulated value. |
-| `cartesian_fold(lambda,arr1,arr2,...)` | folds a multi argument lambda across the cartesian product of all input arrays. The first arguments of the lambda is the array element and the last is the accumulator, returning a single accumulated value. |
-| `any(lambda,arr)` | returns 1 if any element in the array matches the lambda expression, else 0 |
-| `all(lambda,arr)` | returns 1 if all elements in the array matches the lambda expression, else 0 |
+| map(lambda,arr) | applies a transform specified by a single argument lambda expression to all elements of arr, returning a new array |
+| cartesian_map(lambda,arr1,arr2,...) | applies a transform specified by a multi argument lambda expression to all elements of the cartesian product of all input arrays, returning a new array; the number of lambda arguments and array inputs must be the same |
+| filter(lambda,arr) | filters arr by a single argument lambda, returning a new array with all matching elements, or null if no elements match |
+| fold(lambda,arr) | folds a 2 argument lambda across arr. The first argument of the lambda is the array element and the second the accumulator, returning a single accumulated value. |
+| cartesian_fold(lambda,arr1,arr2,...) | folds a multi argument lambda across the cartesian product of all input arrays. The first arguments of the lambda is the array element and the last is the accumulator, returning a single accumulated value. |
+| any(lambda,arr) | returns 1 if any element in the array matches the lambda expression, else 0 |
+| all(lambda,arr) | returns 1 if all elements in the array matches the lambda expression, else 0 |
diff --git a/docs/content/querying/multi-value-dimensions.md b/docs/content/querying/multi-value-dimensions.md
index 04c7357..ce29451 100644
--- a/docs/content/querying/multi-value-dimensions.md
+++ b/docs/content/querying/multi-value-dimensions.md
@@ -24,12 +24,15 @@ title: "Multi-value dimensions"
 
 # Multi-value dimensions
 
-Apache Druid (incubating) supports "multi-value" string dimensions. These are generated when an input field contains an array of values
-instead of a single value (e.e. JSON arrays, or a TSV field containing one or more `listDelimiter` characters).
+Apache Druid (incubating) supports "multi-value" string dimensions. These are generated when an input field contains an
+array of values instead of a single value (e.e. JSON arrays, or a TSV field containing one or more `listDelimiter`
+characters).
 
 This document describes the behavior of groupBy (topN has similar behavior) queries on multi-value dimensions when they
 are used as a dimension being grouped by. See the section on multi-value columns in
-[segments](../design/segments.html#multi-value-columns) for internal representation details.
+[segments](../design/segments.html#multi-value-columns) for internal representation details. Examples in this document
+are in the form of [native Druid queries](querying.html). Refer to the [Druid SQL documentation](sql.html) for details
+about using multi-value string dimensions in SQL.
 
 ## Querying multi-value dimensions
 
@@ -109,9 +112,10 @@ This "selector" filter would match row4 of the dataset above:
 ### Grouping
 
 topN and groupBy queries can group on multi-value dimensions. When grouping on a multi-value dimension, _all_ values
-from matching rows will be used to generate one group per value. It's possible for a query to return more groups than
-there are rows. For example, a topN on the dimension `tags` with filter `"t1" AND "t3"` would match only row1, and
-generate a result with three groups: `t1`, `t2`, and `t3`. If you only need to include values that match
+from matching rows will be used to generate one group per value. This can be thought of as the equivalent to the
+`UNNEST` operator used on an `ARRAY` type that many SQL dialects support. This means it's possible for a query to return
+more groups than there are rows. For example, a topN on the dimension `tags` with filter `"t1" AND "t3"` would match
+only row1, and generate a result with three groups: `t1`, `t2`, and `t3`. If you only need to include values that match
 your filter, you can use a [filtered dimensionSpec](dimensionspecs.html#filtered-dimensionspecs). This can also
 improve performance.
 
@@ -280,11 +284,15 @@ returns following result.
 ]
 ```
 
-You might be surprised to see inclusion of "t1", "t2", "t4" and "t5" in the results. It happens because query filter is applied on the row before explosion. For multi-value dimensions, selector filter for "t3" would match row1 and row2, after which exploding is done. For multi-value dimensions, query filter matches a row if any individual value inside the multiple values matches the query filter.
+You might be surprised to see inclusion of "t1", "t2", "t4" and "t5" in the results. It happens because query filter is
+applied on the row before explosion. For multi-value dimensions, selector filter for "t3" would match row1 and row2,
+after which exploding is done. For multi-value dimensions, query filter matches a row if any individual value inside
+the multiple values matches the query filter.
 
 ### Example: GroupBy query with a selector query filter and additional filter in "dimensions" attributes
 
-To solve the problem above and to get only rows for "t3" returned, you would have to use a "filtered dimension spec" as in the query below.
+To solve the problem above and to get only rows for "t3" returned, you would have to use a "filtered dimension spec" as
+in the query below.
 
 See section on filtered dimensionSpecs in [dimensionSpecs](dimensionspecs.html#filtered-dimensionspecs) for details.
 
@@ -337,4 +345,6 @@ returns the following result.
 ]
 ```
 
-Note that, for groupBy queries, you could get similar result with a [having spec](having.html) but using a filtered dimensionSpec is much more efficient because that gets applied at the lowest level in the query processing pipeline. Having specs are applied at the outermost level of groupBy query processing.
+Note that, for groupBy queries, you could get similar result with a [having spec](having.html) but using a filtered
+dimensionSpec is much more efficient because that gets applied at the lowest level in the query processing pipeline.
+Having specs are applied at the outermost level of groupBy query processing.
diff --git a/docs/content/querying/sql.md b/docs/content/querying/sql.md
index 1f53c70..92b9e41 100644
--- a/docs/content/querying/sql.md
+++ b/docs/content/querying/sql.md
@@ -42,6 +42,61 @@ queries on the query Broker (the first process you query), which are then passed
 queries. Other than the (slight) overhead of translating SQL on the Broker, there isn't an additional performance
 penalty versus native queries.
 
+## Data types and casts
+
+Druid natively supports five basic column types: "long" (64 bit signed int), "float" (32 bit float), "double" (64 bit
+float) "string" (UTF-8 encoded strings and string arrays), and "complex" (catch-all for more exotic data types like
+hyperUnique and approxHistogram columns).
+
+Timestamps (including the `__time` column) are treated by Druid as longs, with the value being the number of
+milliseconds since 1970-01-01 00:00:00 UTC, not counting leap seconds. Therefore, timestamps in Druid do not carry any
+timezone information, but only carry information about the exact moment in time they represent. See the
+[Time functions](#time-functions) section for more information about timestamp handling.
+
+Druid generally treats NULLs and empty strings interchangeably, rather than according to the SQL standard. As such,
+Druid SQL only has partial support for NULLs. For example, the expressions `col IS NULL` and `col = ''` are equivalent,
+and both will evaluate to true if `col` contains an empty string. Similarly, the expression `COALESCE(col1, col2)` will
+return `col2` if `col1` is an empty string. While the `COUNT(*)` aggregator counts all rows, the `COUNT(expr)`
+aggregator will count the number of rows where expr is neither null nor the empty string. String columns in Druid are
+NULLable. Numeric columns are NOT NULL; if you query a numeric column that is not present in all segments of your Druid
+datasource, then it will be treated as zero for rows from those segments.
+
+For mathematical operations, Druid SQL will use integer math if all operands involved in an expression are integers.
+Otherwise, Druid will switch to floating point math. You can force this to happen by casting one of your operands
+to FLOAT. At runtime, Druid may widen 32-bit floats to 64-bit for certain operators, like SUM aggregators.
+
+Druid [multi-value string dimensions](multi-value-dimensions.html) will appear in the table schema as `VARCHAR` typed,
+and may be interacted with in expressions as such. Additionally, they can be treated as `ARRAY` 'like', via a handful of
+special multi-value operators. Expressions against multi-value string dimensions will apply the expression to all values
+of the row, however the caveat is that aggregations on these multi-value string columns will observe the native Druid
+multi-value aggregation behavior, which is equivalent to the `UNNEST` function available in many dialects.
+Refer to the documentation on [multi-value string dimensions](multi-value-dimensions.html) and
+[Druid expressions documentation](../misc/math-expr.html) for additional details. 
+
+The following table describes how SQL types map onto Druid types during query runtime. Casts between two SQL types
+that have the same Druid runtime type will have no effect, other than exceptions noted in the table. Casts between two
+SQL types that have different Druid runtime types will generate a runtime cast in Druid. If a value cannot be properly
+cast to another value, as in `CAST('foo' AS BIGINT)`, the runtime will substitute a default value. NULL values cast
+to non-nullable types will also be substitued with a default value (for example, nulls cast to numbers will be
+converted to zeroes).
+
+|SQL type|Druid runtime type|Default value|Notes|
+|--------|------------------|-------------|-----|
+|CHAR|STRING|`''`||
+|VARCHAR|STRING|`''`|Druid STRING columns are reported as VARCHAR|
+|DECIMAL|DOUBLE|`0.0`|DECIMAL uses floating point, not fixed point math|
+|FLOAT|FLOAT|`0.0`|Druid FLOAT columns are reported as FLOAT|
+|REAL|DOUBLE|`0.0`||
+|DOUBLE|DOUBLE|`0.0`|Druid DOUBLE columns are reported as DOUBLE|
+|BOOLEAN|LONG|`false`||
+|TINYINT|LONG|`0`||
+|SMALLINT|LONG|`0`||
+|INTEGER|LONG|`0`||
+|BIGINT|LONG|`0`|Druid LONG columns (except `__time`) are reported as BIGINT|
+|TIMESTAMP|LONG|`0`, meaning 1970-01-01 00:00:00 UTC|Druid's `__time` column is reported as TIMESTAMP. Casts between string and timestamp types assume standard SQL formatting, e.g. `2000-01-02 03:04:05`, _not_ ISO8601 formatting. For handling other formats, use one of the [time functions](#time-functions)|
+|DATE|LONG|`0`, meaning 1970-01-01|Casting TIMESTAMP to DATE rounds down the timestamp to the nearest day. Casts between string and date types assume standard SQL formatting, e.g. `2000-01-02`. For handling other formats, use one of the [time functions](#time-functions)|
+|OTHER|COMPLEX|none|May represent various Druid column types such as hyperUnique, approxHistogram, etc|
+
 ## Query syntax
 
 Each Druid datasource appears as a table in the "druid" schema. This is also the default schema, so Druid datasources
@@ -269,6 +324,27 @@ simplest way to write literal timestamps in other time zones is to use TIME_PARS
 |`x OR y`|Boolean OR.|
 |`NOT x`|Boolean NOT.|
 
+### Multi-value string functions
+All 'array' references in the multi-value string function documentation can refer to multi-value string columns or
+`ARRAY` literals.
+
+|Function|Notes|
+|--------|-----|
+| `ARRAY(expr1,expr ...)` | constructs an SQL ARRAY literal from the expression arguments, using the type of the first argument as the output array type |
+| `MV_LENGTH(arr)` | returns length of array expression |
+| `MV_OFFSET(arr,long)` | returns the array element at the 0 based index supplied, or null for an out of range index|
+| `MV_ORDINAL(arr,long)` | returns the array element at the 1 based index supplied, or null for an out of range index |
+| `MV_CONTAINS(arr,expr)` | returns 1 if the array contains the element specified by expr, or contains all elements specified by expr if expr is an array, else 0 |
+| `MV_OVERLAP(arr1,arr2)` | returns 1 if arr1 and arr2 have any elements in common, else 0 |
+| `MV_OFFSET_OF(arr,expr)` | returns the 0 based index of the first occurrence of expr in the array, or `-1` or `null` if `druid.generic.useDefaultValueForNull=false` if no matching elements exist in the array. |
+| `MV_ORDINAL_OF(arr,expr)` | returns the 1 based index of the first occurrence of expr in the array, or `-1` or `null` if `druid.generic.useDefaultValueForNull=false` if no matching elements exist in the array. |
+| `MV_PREPEND(expr,arr)` | adds expr to arr at the beginning, the resulting array type determined by the type of the array |
+| `MV_APPEND(arr1,expr)` | appends expr to arr, the resulting array type determined by the type of the first array |
+| `MV_CONCAT(arr1,arr2)` | concatenates 2 arrays, the resulting array type determined by the type of the first array |
+| `MV_SLICE(arr,start,end)` | return the subarray of arr from the 0 based index start(inclusive) to end(exclusive), or `null`, if start is less than 0, greater than length of arr or less than end|
+| `MV_TO_STRING(arr,str)` | joins all elements of arr by the delimiter specified by str |
+| `STRING_TO_MV(str1,str2)` | splits str1 into an array on the delimiter specified by str2 |
+
 ### Other functions
 
 |Function|Notes|
@@ -280,6 +356,7 @@ simplest way to write literal timestamps in other time zones is to use TIME_PARS
 |`COALESCE(value1, value2, ...)`|Returns the first value that is neither NULL nor empty string.|
 |`NVL(expr,expr-for-null)`|Returns 'expr-for-null' if 'expr' is null (or empty string for string type).|
 |`BLOOM_FILTER_TEST(<expr>, <serialized-filter>)`|Returns true if the value is contained in the base64 serialized bloom filter. See [bloom filter extension](../development/extensions-core/bloom-filter.html) documentation for additional details.|
+
 ### Unsupported features
 
 Druid does not support all SQL features, including:
@@ -291,57 +368,10 @@ Druid does not support all SQL features, including:
 
 Additionally, some Druid features are not supported by the SQL language. Some unsupported Druid features include:
 
-- [Multi-value dimensions](multi-value-dimensions.html).
-- [DataSketches aggregators](../development/extensions-core/datasketches-extension.html).
+- [Set operations on DataSketches aggregators](../development/extensions-core/datasketches-extension.html).
 - [Spatial filters](../development/geo.html).
 - [Query cancellation](querying.html#query-cancellation).
 
-## Data types and casts
-
-Druid natively supports five basic column types: "long" (64 bit signed int), "float" (32 bit float), "double" (64 bit
-float) "string" (UTF-8 encoded strings), and "complex" (catch-all for more exotic data types like hyperUnique and
-approxHistogram columns).
-
-Timestamps (including the `__time` column) are treated by Druid as longs, with the value being the number of
-milliseconds since 1970-01-01 00:00:00 UTC, not counting leap seconds. Therefore, timestamps in Druid do not carry any
-timezone information, but only carry information about the exact moment in time they represent. See the
-[Time functions](#time-functions) section for more information about timestamp handling.
-
-Druid generally treats NULLs and empty strings interchangeably, rather than according to the SQL standard. As such,
-Druid SQL only has partial support for NULLs. For example, the expressions `col IS NULL` and `col = ''` are equivalent,
-and both will evaluate to true if `col` contains an empty string. Similarly, the expression `COALESCE(col1, col2)` will
-return `col2` if `col1` is an empty string. While the `COUNT(*)` aggregator counts all rows, the `COUNT(expr)`
-aggregator will count the number of rows where expr is neither null nor the empty string. String columns in Druid are
-NULLable. Numeric columns are NOT NULL; if you query a numeric column that is not present in all segments of your Druid
-datasource, then it will be treated as zero for rows from those segments.
-
-For mathematical operations, Druid SQL will use integer math if all operands involved in an expression are integers.
-Otherwise, Druid will switch to floating point math. You can force this to happen by casting one of your operands
-to FLOAT. At runtime, Druid may widen 32-bit floats to 64-bit for certain operators, like SUM aggregators.
-
-The following table describes how SQL types map onto Druid types during query runtime. Casts between two SQL types
-that have the same Druid runtime type will have no effect, other than exceptions noted in the table. Casts between two
-SQL types that have different Druid runtime types will generate a runtime cast in Druid. If a value cannot be properly
-cast to another value, as in `CAST('foo' AS BIGINT)`, the runtime will substitute a default value. NULL values cast
-to non-nullable types will also be substitued with a default value (for example, nulls cast to numbers will be
-converted to zeroes).
-
-|SQL type|Druid runtime type|Default value|Notes|
-|--------|------------------|-------------|-----|
-|CHAR|STRING|`''`||
-|VARCHAR|STRING|`''`|Druid STRING columns are reported as VARCHAR|
-|DECIMAL|DOUBLE|`0.0`|DECIMAL uses floating point, not fixed point math|
-|FLOAT|FLOAT|`0.0`|Druid FLOAT columns are reported as FLOAT|
-|REAL|DOUBLE|`0.0`||
-|DOUBLE|DOUBLE|`0.0`|Druid DOUBLE columns are reported as DOUBLE|
-|BOOLEAN|LONG|`false`||
-|TINYINT|LONG|`0`||
-|SMALLINT|LONG|`0`||
-|INTEGER|LONG|`0`||
-|BIGINT|LONG|`0`|Druid LONG columns (except `__time`) are reported as BIGINT|
-|TIMESTAMP|LONG|`0`, meaning 1970-01-01 00:00:00 UTC|Druid's `__time` column is reported as TIMESTAMP. Casts between string and timestamp types assume standard SQL formatting, e.g. `2000-01-02 03:04:05`, _not_ ISO8601 formatting. For handling other formats, use one of the [time functions](#time-functions)|
-|DATE|LONG|`0`, meaning 1970-01-01|Casting TIMESTAMP to DATE rounds down the timestamp to the nearest day. Casts between string and date types assume standard SQL formatting, e.g. `2000-01-02`. For handling other formats, use one of the [time functions](#time-functions)|
-|OTHER|COMPLEX|none|May represent various Druid column types such as hyperUnique, approxHistogram, etc|
 
 ## Query execution
 


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org