You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by su...@apache.org on 2021/07/26 19:29:19 UTC

[druid] branch master updated: Docs - HLL lgK tip and slight layout change (#11482)

This is an automated email from the ASF dual-hosted git repository.

suneet pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/druid.git


The following commit(s) were added to refs/heads/master by this push:
     new 973e5bf  Docs - HLL lgK tip and slight layout change (#11482)
973e5bf is described below

commit 973e5bf7d06c6cb021c242360b2035deb571541d
Author: Peter Marshall <42...@users.noreply.github.com>
AuthorDate: Mon Jul 26 20:28:53 2021 +0100

    Docs - HLL lgK tip and slight layout change (#11482)
    
    * HLL lgK and a tip
    
    Knowledge transfer from https://the-asf.slack.com/archives/CJ8D1JTB8/p1600699967024200.  Attempted to make a connection between the SQL HLL function and the HLL underneath without getting too complicated.  Also added a note about using K over 16 being pretty much pointless.
    
    * Corrected spelling
    
    * Create datasketches-hll.md
    
    Put roll-up back to rollup
    
    * Update docs/development/extensions-core/datasketches-hll.md
    
    Co-authored-by: Abhishek Agarwal <14...@users.noreply.github.com>
    
    Co-authored-by: Abhishek Agarwal <14...@users.noreply.github.com>
---
 .../extensions-core/datasketches-hll.md            | 42 +++++++++++++++++-----
 docs/querying/sql.md                               |  4 +--
 2 files changed, 35 insertions(+), 11 deletions(-)

diff --git a/docs/development/extensions-core/datasketches-hll.md b/docs/development/extensions-core/datasketches-hll.md
index cc39e7e..c359dc7 100644
--- a/docs/development/extensions-core/datasketches-hll.md
+++ b/docs/development/extensions-core/datasketches-hll.md
@@ -34,6 +34,20 @@ druid.extensions.loadList=["druid-datasketches"]
 
 ### Aggregators
 
+|property|description|required?|
+|--------|-----------|---------|
+|`type`|This String should be [`HLLSketchBuild`](#hllsketchbuild-aggregator) or [`HLLSketchMerge`](#hllsketchmerge-aggregator)|yes|
+|`name`|A String for the output (result) name of the calculation.|yes|
+|`fieldName`|A String for the name of the input field.|yes|
+|`lgK`|log2 of K that is the number of buckets in the sketch, parameter that controls the size and the accuracy. Must be a power of 2 from 4 to 21 inclusively.|no, defaults to `12`|
+|`tgtHllType`|The type of the target HLL sketch. Must be `HLL_4`, `HLL_6` or `HLL_8` |no, defaults to `HLL_4`|
+|`round`|Round off values to whole numbers. Only affects query-time behavior and is ignored at ingestion-time.|no, defaults to `false`|
+
+
+> The default `lgK` value has proven to be sufficient for most use cases; expect only very negligible improvements in accuracy with `lgK` values over `16` in normal circumstances.
+
+#### HLLSketchBuild Aggregator
+
 ```
 {
   "type" : "HLLSketchBuild",
@@ -45,6 +59,25 @@ druid.extensions.loadList=["druid-datasketches"]
  }
 ```
 
+> It is very common to use `HLLSketchBuild` in combination with [rollup](../../ingestion/index.html#rollup) to create a [metric](../../ingestion/index.html#metricsspec) on high-cardinality columns.  In this example, a metric called `userid_hll` is included in the `metricsSpec`.  This will perform a HLL sketch on the `userid` field at ingestion time, allowing for highly-performant approximate `COUNT DISTINCT` query operations and improving roll-up ratios when `userid` is then left out of  [...]
+>
+> ```
+> :
+> "metricsSpec": [
+>  {
+>    "type" : "HLLSketchBuild",
+>    "name" : "userid_hll",
+>    "fieldName" : "userid",
+>    "lgK" : 12,
+>    "tgtHllType" : "HLL_4"
+>  }
+> ]
+> :
+> ```
+>
+
+#### HLLSketchMerge Aggregator
+
 ```
 {
   "type" : "HLLSketchMerge",
@@ -56,15 +89,6 @@ druid.extensions.loadList=["druid-datasketches"]
  }
 ```
 
-|property|description|required?|
-|--------|-----------|---------|
-|type|This String should be "HLLSketchBuild" or "HLLSketchMerge"|yes|
-|name|A String for the output (result) name of the calculation.|yes|
-|fieldName|A String for the name of the input field.|yes|
-|lgK|log2 of K that is the number of buckets in the sketch, parameter that controls the size and the accuracy. Must be a power of 2 from 4 to 21 inclusively.|no, defaults to 12|
-|tgtHllType|The type of the target HLL sketch. Must be "HLL&lowbar;4", "HLL&lowbar;6" or "HLL&lowbar;8" |no, defaults to "HLL&lowbar;4"|
-|round|Round off values to whole numbers. Only affects query-time behavior and is ignored at ingestion-time.|no, defaults to false|
-
 ### Post Aggregators
 
 #### Estimate
diff --git a/docs/querying/sql.md b/docs/querying/sql.md
index fd5903d..00e801d 100644
--- a/docs/querying/sql.md
+++ b/docs/querying/sql.md
@@ -334,8 +334,8 @@ Only the COUNT and ARRAY_AGG aggregations can accept the DISTINCT keyword.
 |`MAX(expr)`|Takes the maximum of numbers.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `-9223372036854775808` (minimum LONG value)|
 |`AVG(expr)`|Averages numbers.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `0`|
 |`APPROX_COUNT_DISTINCT(expr)`|Counts distinct values of expr, which can be a regular column or a hyperUnique column. This is always approximate, regardless of the value of "useApproximateCountDistinct". This uses Druid's built-in "cardinality" or "hyperUnique" aggregators. See also `COUNT(DISTINCT expr)`.|`0`|
-|`APPROX_COUNT_DISTINCT_DS_HLL(expr, [lgK, tgtHllType])`|Counts distinct values of expr, which can be a regular column or an [HLL sketch](../development/extensions-core/datasketches-hll.md) column. The `lgK` and `tgtHllType` parameters are described in the HLL sketch documentation. This is always approximate, regardless of the value of "useApproximateCountDistinct". See also `COUNT(DISTINCT expr)`. The [DataSketches extension](../development/extensions-core/datasketches-extension.md) mus [...]
-|`APPROX_COUNT_DISTINCT_DS_THETA(expr, [size])`|Counts distinct values of expr, which can be a regular column or a [Theta sketch](../development/extensions-core/datasketches-theta.md) column. The `size` parameter is described in the Theta sketch documentation. This is always approximate, regardless of the value of "useApproximateCountDistinct". See also `COUNT(DISTINCT expr)`. The [DataSketches extension](../development/extensions-core/datasketches-extension.md) must be loaded to use thi [...]
+|`APPROX_COUNT_DISTINCT_DS_HLL(expr, [lgK, tgtHllType])`|Counts distinct values of `expr`, which can be a regular column or an [HLL sketch](../development/extensions-core/datasketches-hll.md) column. Results are always approximate, regardless of the value of [`useApproximateCountDistinct`](../querying/sql.html#connection-context). The `lgK` and `tgtHllType` parameters here are, like the equivalents in the [aggregator](../development/extensions-core/datasketches-hll.html#aggregators), des [...]
+|`APPROX_COUNT_DISTINCT_DS_THETA(expr, [size])`|Counts distinct values of expr, which can be a regular column or a [Theta sketch](../development/extensions-core/datasketches-theta.md) column. This is always approximate, regardless of the value of [`useApproximateCountDistinct`](../querying/sql.html#connection-context).  The `size` parameter is described in the Theta sketch documentation. The [DataSketches extension](../development/extensions-core/datasketches-extension.md) must be loaded [...]
 |`DS_HLL(expr, [lgK, tgtHllType])`|Creates an [HLL sketch](../development/extensions-core/datasketches-hll.md) on the values of expr, which can be a regular column or a column containing HLL sketches. The `lgK` and `tgtHllType` parameters are described in the HLL sketch documentation. The [DataSketches extension](../development/extensions-core/datasketches-extension.md) must be loaded to use this function.|`'0'` (STRING)|
 |`DS_THETA(expr, [size])`|Creates a [Theta sketch](../development/extensions-core/datasketches-theta.md) on the values of expr, which can be a regular column or a column containing Theta sketches. The `size` parameter is described in the Theta sketch documentation. The [DataSketches extension](../development/extensions-core/datasketches-extension.md) must be loaded to use this function.|`'0.0'` (STRING)|
 |`APPROX_QUANTILE(expr, probability, [resolution])`|Computes approximate quantiles on numeric or [approxHistogram](../development/extensions-core/approximate-histograms.md#approximate-histogram-aggregator) exprs. The "probability" should be between 0 and 1 (exclusive). The "resolution" is the number of centroids to use for the computation. Higher resolutions will give more precise results but also have higher overhead. If not provided, the default resolution is 50. The [approximate histo [...]

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org