You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2021/08/11 00:45:34 UTC
[GitHub] [druid] jihoonson opened a new pull request #11574: Configurable maxStreamLength for doubles sketches
jihoonson opened a new pull request #11574:
URL: https://github.com/apache/druid/pull/11574
### Description
A temporary config to work around the issue filed in https://github.com/apache/druid/issues/11544. For native queries, a new parameter, `maxStreamLength` is added for `DoublesSketchAggregatorFactory`. For SQL, a new query context, `approxQuantileDsMaxStreamLength` is added. Note that `approxQuantileDsMaxStreamLength` does not control `maxStreamLength` per individual `DoublesSketchAggregatorFactory`, but applies the same value for all aggregator factories. The rationale for this decision is as below.
- `approxQuantileDsMaxStreamLength` should not be used to optimize the memory usage of the sketch aggregator, but only used to work around #11544.
- Thus, I don't think users would ever set it to something smaller than default.
- As noted in https://github.com/apache/druid/issues/11544#issuecomment-895525040, the sketch grows quite slowly at this scale.
- As a result, it doesn't seem worth allowing users to set different `maxStreamLength` for each SQL function. I did it for native queries only because it was the least invasive.
<hr>
##### Key changed/added classes in this PR
* `DoublesSketches`
<hr>
<!-- Check the items by putting "x" in the brackets for the done things. Not all of these items apply to every PR. Remove the items which are not done or not relevant to the PR. None of the items from the checklist below are strictly necessary, but it would be very helpful if you at least self-review the PR. -->
This PR has:
- [x] been self-reviewed.
- [ ] using the [concurrency checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md) (Remove this item if the PR doesn't have any relation to concurrency.)
- [x] added documentation for new or modified features or behaviors.
- [x] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
- [ ] added or updated version, license, or notice information in [licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
- [x] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
- [x] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met.
- [ ] added integration tests.
- [ ] been tested in a test Druid cluster.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org
[GitHub] [druid] clintropolis merged pull request #11574: Configurable maxStreamLength for doubles sketches
Posted by GitBox <gi...@apache.org>.
clintropolis merged pull request #11574:
URL: https://github.com/apache/druid/pull/11574
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org
[GitHub] [druid] jihoonson commented on a change in pull request #11574: Configurable maxStreamLength for doubles sketches
Posted by GitBox <gi...@apache.org>.
jihoonson commented on a change in pull request #11574:
URL: https://github.com/apache/druid/pull/11574#discussion_r695375561
##########
File path: extensions-core/datasketches/src/main/java/org/apache/druid/query/aggregation/datasketches/quantiles/DoublesSketches.java
##########
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.query.aggregation.datasketches.quantiles;
+
+import org.apache.druid.java.util.common.ISE;
+
+public final class DoublesSketches
+{
+ /**
+ * Runs the given task that updates a Doubles sketch backed by a direct byte buffer. This method intentionally
+ * accepts the update task as a {@link Runnable} instead of accpeting parameters of a sketch and other values
+ * needed for the update. This is to avoid any potential performance impact especially when the sketch is updated
Review comment:
I ran the benchmark I added in https://github.com/apache/druid/pull/11574/commits/7be41051242f16db225dfce48f86adc9b1966c3a. It doesn't seem to have any difference.
```
// 20: doubles sketches
query: "SELECT APPROX_QUANTILE_DS(sumFloatNormal, 0.5), DS_QUANTILES_SKETCH(maxLongUniform) FROM foo"
master
Benchmark (query) (rowsPerSegment) (vectorize) Mode Cnt Score Error Units
SqlBenchmark.querySql 20 5000000 false avgt 15 341.659 ± 0.932 ms/op
SqlBenchmark.querySql 20 5000000 force avgt 15 446.572 ± 3.279 ms/op
PR
Benchmark (query) (rowsPerSegment) (vectorize) Mode Cnt Score Error Units
SqlBenchmark.querySql 20 5000000 false avgt 15 341.956 ± 2.478 ms/op
SqlBenchmark.querySql 20 5000000 force avgt 15 442.048 ± 1.420 ms/op
```
Interestingly, the query was _slower_ when the vectorization was on. AFAIT, the difference is that the non-vectorized version uses `HeapUpdateDoublesSketch` while the vectorized one uses `DirectUpdateDoublesSketch`. Attaching flame graphs.
Non-vectorized
![Screenshot from 2021-08-24 21-02-35](https://user-images.githubusercontent.com/2322288/130724807-18b430fb-ca4c-4644-aae3-20612b0cd0cf.png)
Vectorized
![Screenshot from 2021-08-24 21-02-57](https://user-images.githubusercontent.com/2322288/130724816-74bdabe7-b7cf-439f-855a-2dd7dbd6460e.png)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org
[GitHub] [druid] techdocsmith commented on a change in pull request #11574: Configurable maxStreamLength for doubles sketches
Posted by GitBox <gi...@apache.org>.
techdocsmith commented on a change in pull request #11574:
URL: https://github.com/apache/druid/pull/11574#discussion_r695936382
##########
File path: docs/development/extensions-core/datasketches-quantiles.md
##########
@@ -56,6 +56,7 @@ The result of the aggregation is a DoublesSketch that is the union of all sketch
|name|A String for the output (result) name of the calculation.|yes|
|fieldName|A String for the name of the input field (can contain sketches or raw numeric values).|yes|
|k|Parameter that determines the accuracy and size of the sketch. Higher k means higher accuracy but more space to store sketches. Must be a power of 2 from 2 to 32768. See [accuracy information](https://datasketches.apache.org/docs/Quantiles/OrigQuantilesSketch) in the DataSketches documentation for details.|no, defaults to 128|
+|maxStreamLength|The max number of items that can be stored in each sketch. Currently, the query can throw `IllegalStateException` when one of the sketch hits this limit. See the [GitHub issue](https://github.com/apache/druid/issues/11544) for more details of the bug. A known workaround is using a higher max stream length. See [accuracy information](https://datasketches.apache.org/docs/Quantiles/OrigQuantilesSketch) in the DataSketches documentation for how many bytes are required per stream length. Since this parameter is added as a temporary solution to avoid the known issue, it can be removed in a future release when the bug is fixed.|no, defaults to 1,000,000,000|
Review comment:
```suggestion
|maxStreamLength|The maximum number of items to store in each sketch. If one of the sketches hits this limit, the query can throw an `IllegalStateException`. See the [GitHub issue 11544](https://github.com/apache/druid/issues/11544) for more details. To workaround this issue, increase the max stream length. See [accuracy information](https://datasketches.apache.org/docs/Quantiles/OrigQuantilesSketch) in the DataSketches documentation for how many bytes are required per stream length. This parameter is added as a temporary solution to avoid the known issue. It may be removed in a future release after the bug is fixed.|no, defaults to 1,000,000,000|
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org
[GitHub] [druid] techdocsmith commented on a change in pull request #11574: Configurable maxStreamLength for doubles sketches
Posted by GitBox <gi...@apache.org>.
techdocsmith commented on a change in pull request #11574:
URL: https://github.com/apache/druid/pull/11574#discussion_r695939705
##########
File path: docs/development/extensions-core/datasketches-quantiles.md
##########
@@ -56,6 +56,7 @@ The result of the aggregation is a DoublesSketch that is the union of all sketch
|name|A String for the output (result) name of the calculation.|yes|
|fieldName|A String for the name of the input field (can contain sketches or raw numeric values).|yes|
|k|Parameter that determines the accuracy and size of the sketch. Higher k means higher accuracy but more space to store sketches. Must be a power of 2 from 2 to 32768. See [accuracy information](https://datasketches.apache.org/docs/Quantiles/OrigQuantilesSketch) in the DataSketches documentation for details.|no, defaults to 128|
+|maxStreamLength|This parameter is added as a temporary solution to avoid a [known issue](https://github.com/apache/druid/issues/11544), and can be removed in a future release when the bug is fixed. With this parameter, you can control the max number of items that can be stored in each sketch. Currently, the query can throw `IllegalStateException` when one of the sketch reaches this limit. A known workaround is using a higher max stream length. See [accuracy information](https://datasketches.apache.org/docs/Quantiles/OrigQuantilesSketch) in the DataSketches documentation for how many bytes are required per stream length.|no, defaults to 1000000000|
Review comment:
```suggestion
|maxStreamLength|This parameter is a temporary solution to avoid a [known issue](https://github.com/apache/druid/issues/11544). It may be removed in a future release after the bug is fixed. This parameter defines the maximum number of items to store in each sketch. If a sketch reaches the limit, the query can throw `IllegalStateException`. To workaround this issue, increase the maximum stream length. See [accuracy information](https://datasketches.apache.org/docs/Quantiles/OrigQuantilesSketch) in the DataSketches documentation for how many bytes are required per stream length.|no, defaults to 1000000000|
```
##########
File path: docs/querying/sql.md
##########
@@ -339,9 +339,9 @@ Only the COUNT, ARRAY_AGG, and STRING_AGG aggregations can accept the DISTINCT k
|`DS_HLL(expr, [lgK, tgtHllType])`|Creates an [HLL sketch](../development/extensions-core/datasketches-hll.md) on the values of expr, which can be a regular column or a column containing HLL sketches. The `lgK` and `tgtHllType` parameters are described in the HLL sketch documentation. The [DataSketches extension](../development/extensions-core/datasketches-extension.md) must be loaded to use this function.|`'0'` (STRING)|
|`DS_THETA(expr, [size])`|Creates a [Theta sketch](../development/extensions-core/datasketches-theta.md) on the values of expr, which can be a regular column or a column containing Theta sketches. The `size` parameter is described in the Theta sketch documentation. The [DataSketches extension](../development/extensions-core/datasketches-extension.md) must be loaded to use this function.|`'0.0'` (STRING)|
|`APPROX_QUANTILE(expr, probability, [resolution])`|_Deprecated._ Use `APPROX_QUANTILE_DS` instead, which provides a superior distribution-independent algorithm with formal error guarantees.<br/><br/>Computes approximate quantiles on numeric or [approxHistogram](../development/extensions-core/approximate-histograms.md#approximate-histogram-aggregator) exprs. The "probability" should be between 0 and 1 (exclusive). The "resolution" is the number of centroids to use for the computation. Higher resolutions will give more precise results but also have higher overhead. If not provided, the default resolution is 50. The [approximate histogram extension](../development/extensions-core/approximate-histograms.md) must be loaded to use this function.|`NaN`|
-|`APPROX_QUANTILE_DS(expr, probability, [k])`|Computes approximate quantiles on numeric or [Quantiles sketch](../development/extensions-core/datasketches-quantiles.md) exprs. The "probability" should be between 0 and 1 (exclusive). The `k` parameter is described in the Quantiles sketch documentation. The [DataSketches extension](../development/extensions-core/datasketches-extension.md) must be loaded to use this function.|`NaN`|
+|`APPROX_QUANTILE_DS(expr, probability, [k])`|Computes approximate quantiles on numeric or [Quantiles sketch](../development/extensions-core/datasketches-quantiles.md) exprs. The "probability" should be between 0 and 1 (exclusive). The `k` parameter is described in the Quantiles sketch documentation. The [DataSketches extension](../development/extensions-core/datasketches-extension.md) must be loaded to use this function.<br/><br/>See the [below section](#a-known-issue-with-approximate-functions-based-on-data-sketches) for a known issue with using this function.|`NaN`|
Review comment:
```suggestion
|`APPROX_QUANTILE_DS(expr, probability, [k])`|Computes approximate quantiles on numeric or [Quantiles sketch](../development/extensions-core/datasketches-quantiles.md) exprs. Allowable "probability" values are between 0 and 1, exclusive. The `k` parameter is described in the Quantiles sketch documentation. You must load [DataSketches extension](../development/extensions-core/datasketches-extension.md) to use this function.<br/><br/>See the [known issue](#a-known-issue-with-approximate-functions-based-on-data-sketches) with this function.|`NaN`|
```
##########
File path: docs/querying/sql.md
##########
@@ -820,6 +820,16 @@ either through query context or through Broker configuration.
- Aggregation functions that are labeled as using sketches or approximations, such as APPROX_COUNT_DISTINCT, are always
approximate, regardless of configuration.
+#### A known issue with approximate functions based on data sketches
+
+The `APPROX_QUANTILE_DS` and `DS_QUANTILES_SKETCH` functions can fail with `IllegalStateException` if one of the sketches created
+during the query hits `maxStreamLength`, which is the max number of items that can be stored in each sketch.
Review comment:
```suggestion
the query hits `maxStreamLength`: the maximum number of items to store in each sketch.
```
##########
File path: docs/querying/sql.md
##########
@@ -339,9 +339,9 @@ Only the COUNT, ARRAY_AGG, and STRING_AGG aggregations can accept the DISTINCT k
|`DS_HLL(expr, [lgK, tgtHllType])`|Creates an [HLL sketch](../development/extensions-core/datasketches-hll.md) on the values of expr, which can be a regular column or a column containing HLL sketches. The `lgK` and `tgtHllType` parameters are described in the HLL sketch documentation. The [DataSketches extension](../development/extensions-core/datasketches-extension.md) must be loaded to use this function.|`'0'` (STRING)|
|`DS_THETA(expr, [size])`|Creates a [Theta sketch](../development/extensions-core/datasketches-theta.md) on the values of expr, which can be a regular column or a column containing Theta sketches. The `size` parameter is described in the Theta sketch documentation. The [DataSketches extension](../development/extensions-core/datasketches-extension.md) must be loaded to use this function.|`'0.0'` (STRING)|
|`APPROX_QUANTILE(expr, probability, [resolution])`|_Deprecated._ Use `APPROX_QUANTILE_DS` instead, which provides a superior distribution-independent algorithm with formal error guarantees.<br/><br/>Computes approximate quantiles on numeric or [approxHistogram](../development/extensions-core/approximate-histograms.md#approximate-histogram-aggregator) exprs. The "probability" should be between 0 and 1 (exclusive). The "resolution" is the number of centroids to use for the computation. Higher resolutions will give more precise results but also have higher overhead. If not provided, the default resolution is 50. The [approximate histogram extension](../development/extensions-core/approximate-histograms.md) must be loaded to use this function.|`NaN`|
-|`APPROX_QUANTILE_DS(expr, probability, [k])`|Computes approximate quantiles on numeric or [Quantiles sketch](../development/extensions-core/datasketches-quantiles.md) exprs. The "probability" should be between 0 and 1 (exclusive). The `k` parameter is described in the Quantiles sketch documentation. The [DataSketches extension](../development/extensions-core/datasketches-extension.md) must be loaded to use this function.|`NaN`|
+|`APPROX_QUANTILE_DS(expr, probability, [k])`|Computes approximate quantiles on numeric or [Quantiles sketch](../development/extensions-core/datasketches-quantiles.md) exprs. The "probability" should be between 0 and 1 (exclusive). The `k` parameter is described in the Quantiles sketch documentation. The [DataSketches extension](../development/extensions-core/datasketches-extension.md) must be loaded to use this function.<br/><br/>See the [below section](#a-known-issue-with-approximate-functions-based-on-data-sketches) for a known issue with using this function.|`NaN`|
|`APPROX_QUANTILE_FIXED_BUCKETS(expr, probability, numBuckets, lowerLimit, upperLimit, [outlierHandlingMode])`|Computes approximate quantiles on numeric or [fixed buckets histogram](../development/extensions-core/approximate-histograms.md#fixed-buckets-histogram) exprs. The "probability" should be between 0 and 1 (exclusive). The `numBuckets`, `lowerLimit`, `upperLimit`, and `outlierHandlingMode` parameters are described in the fixed buckets histogram documentation. The [approximate histogram extension](../development/extensions-core/approximate-histograms.md) must be loaded to use this function.|`0.0`|
-|`DS_QUANTILES_SKETCH(expr, [k])`|Creates a [Quantiles sketch](../development/extensions-core/datasketches-quantiles.md) on the values of expr, which can be a regular column or a column containing quantiles sketches. The `k` parameter is described in the Quantiles sketch documentation. The [DataSketches extension](../development/extensions-core/datasketches-extension.md) must be loaded to use this function.|`'0'` (STRING)|
+|`DS_QUANTILES_SKETCH(expr, [k])`|Creates a [Quantiles sketch](../development/extensions-core/datasketches-quantiles.md) on the values of expr, which can be a regular column or a column containing quantiles sketches. The `k` parameter is described in the Quantiles sketch documentation. The [DataSketches extension](../development/extensions-core/datasketches-extension.md) must be loaded to use this function.<br/><br/>See the [below section](#a-known-issue-with-approximate-functions-based-on-data-sketches) for a known issue with using this function.|`'0'` (STRING)|
Review comment:
```suggestion
|`DS_QUANTILES_SKETCH(expr, [k])`|Creates a [Quantiles sketch](../development/extensions-core/datasketches-quantiles.md) on the values of `expr`, which can be a regular column or a column containing quantiles sketches. The `k` parameter is described in the Quantiles sketch documentation. You must load the [DataSketches extension](../development/extensions-core/datasketches-extension.md) to use this function.<br/><br/>See the [known issue](#a-known-issue-with-approximate-functions-based-on-data-sketches) with this function.|`'0'` (STRING)|
```
##########
File path: docs/querying/sql.md
##########
@@ -820,6 +820,16 @@ either through query context or through Broker configuration.
- Aggregation functions that are labeled as using sketches or approximations, such as APPROX_COUNT_DISTINCT, are always
approximate, regardless of configuration.
+#### A known issue with approximate functions based on data sketches
+
+The `APPROX_QUANTILE_DS` and `DS_QUANTILES_SKETCH` functions can fail with `IllegalStateException` if one of the sketches created
+during the query hits `maxStreamLength`, which is the max number of items that can be stored in each sketch.
+See the [GitHub issue](https://github.com/apache/druid/issues/11544) for more details of the bug.
+A known workaround is raising the max stream length to something higher by setting `approxQuantileDsMaxStreamLength`
+in the query context. Since it is set to 1,000,000,000 by default, you don't need to override it in most cases.
+See [accuracy information](https://datasketches.apache.org/docs/Quantiles/OrigQuantilesSketch) in the DataSketches documentation for how many bytes are required per stream length.
+Since this query context is added as a temporary solution to avoid the known issue, it can be removed in a future release when the bug is fixed.
Review comment:
```suggestion
This query context parameter is a temporary solution to avoid the known issue. It may be removed in a future release after the bug is fixed.
```
##########
File path: docs/querying/sql.md
##########
@@ -820,6 +820,16 @@ either through query context or through Broker configuration.
- Aggregation functions that are labeled as using sketches or approximations, such as APPROX_COUNT_DISTINCT, are always
approximate, regardless of configuration.
+#### A known issue with approximate functions based on data sketches
+
+The `APPROX_QUANTILE_DS` and `DS_QUANTILES_SKETCH` functions can fail with `IllegalStateException` if one of the sketches created
Review comment:
```suggestion
The `APPROX_QUANTILE_DS` and `DS_QUANTILES_SKETCH` functions can fail with an `IllegalStateException` if one of the sketches for
```
##########
File path: docs/querying/sql.md
##########
@@ -820,6 +820,16 @@ either through query context or through Broker configuration.
- Aggregation functions that are labeled as using sketches or approximations, such as APPROX_COUNT_DISTINCT, are always
approximate, regardless of configuration.
+#### A known issue with approximate functions based on data sketches
+
+The `APPROX_QUANTILE_DS` and `DS_QUANTILES_SKETCH` functions can fail with `IllegalStateException` if one of the sketches created
+during the query hits `maxStreamLength`, which is the max number of items that can be stored in each sketch.
+See the [GitHub issue](https://github.com/apache/druid/issues/11544) for more details of the bug.
Review comment:
```suggestion
See [GitHub issue 11544](https://github.com/apache/druid/issues/11544) for more details.
```
##########
File path: docs/querying/sql.md
##########
@@ -820,6 +820,16 @@ either through query context or through Broker configuration.
- Aggregation functions that are labeled as using sketches or approximations, such as APPROX_COUNT_DISTINCT, are always
approximate, regardless of configuration.
+#### A known issue with approximate functions based on data sketches
+
+The `APPROX_QUANTILE_DS` and `DS_QUANTILES_SKETCH` functions can fail with `IllegalStateException` if one of the sketches created
+during the query hits `maxStreamLength`, which is the max number of items that can be stored in each sketch.
+See the [GitHub issue](https://github.com/apache/druid/issues/11544) for more details of the bug.
+A known workaround is raising the max stream length to something higher by setting `approxQuantileDsMaxStreamLength`
Review comment:
```suggestion
To workaround the issue, increase value of the maximum string length with the `approxQuantileDsMaxStreamLength` parameter
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org
[GitHub] [druid] techdocsmith commented on a change in pull request #11574: Configurable maxStreamLength for doubles sketches
Posted by GitBox <gi...@apache.org>.
techdocsmith commented on a change in pull request #11574:
URL: https://github.com/apache/druid/pull/11574#discussion_r695936382
##########
File path: docs/development/extensions-core/datasketches-quantiles.md
##########
@@ -56,6 +56,7 @@ The result of the aggregation is a DoublesSketch that is the union of all sketch
|name|A String for the output (result) name of the calculation.|yes|
|fieldName|A String for the name of the input field (can contain sketches or raw numeric values).|yes|
|k|Parameter that determines the accuracy and size of the sketch. Higher k means higher accuracy but more space to store sketches. Must be a power of 2 from 2 to 32768. See [accuracy information](https://datasketches.apache.org/docs/Quantiles/OrigQuantilesSketch) in the DataSketches documentation for details.|no, defaults to 128|
+|maxStreamLength|The max number of items that can be stored in each sketch. Currently, the query can throw `IllegalStateException` when one of the sketch hits this limit. See the [GitHub issue](https://github.com/apache/druid/issues/11544) for more details of the bug. A known workaround is using a higher max stream length. See [accuracy information](https://datasketches.apache.org/docs/Quantiles/OrigQuantilesSketch) in the DataSketches documentation for how many bytes are required per stream length. Since this parameter is added as a temporary solution to avoid the known issue, it can be removed in a future release when the bug is fixed.|no, defaults to 1,000,000,000|
Review comment:
```suggestion
|maxStreamLength|The maximum number of items to store in each sketch. If one of the sketches hits this limit, the query can throw an `IllegalStateException`. See the [GitHub issue 11544](https://github.com/apache/druid/issues/11544) for more details. To workaround this issue, increase the max stream length. See [accuracy information](https://datasketches.apache.org/docs/Quantiles/OrigQuantilesSketch) in the DataSketches documentation for how many bytes are required per stream length. This parameter is a temporary solution to avoid the known issue. It may be removed in a future release after the bug is fixed.|no, defaults to 1,000,000,000|
```
Some grammar / style changes. Otherwise LGTM
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org
[GitHub] [druid] jihoonson commented on a change in pull request #11574: Configurable maxStreamLength for doubles sketches
Posted by GitBox <gi...@apache.org>.
jihoonson commented on a change in pull request #11574:
URL: https://github.com/apache/druid/pull/11574#discussion_r695418946
##########
File path: extensions-core/datasketches/src/main/java/org/apache/druid/query/aggregation/datasketches/quantiles/DoublesSketches.java
##########
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.query.aggregation.datasketches.quantiles;
+
+import org.apache.druid.java.util.common.ISE;
+
+public final class DoublesSketches
+{
+ /**
+ * Runs the given task that updates a Doubles sketch backed by a direct byte buffer. This method intentionally
+ * accepts the update task as a {@link Runnable} instead of accpeting parameters of a sketch and other values
+ * needed for the update. This is to avoid any potential performance impact especially when the sketch is updated
Review comment:
I tested a couple of more queries that produce more than one row. Looking at these results, the change in this PR doesn't seem to introduce huge overhead. However, the performance of these queries does seem interesting when vectorization is on. For query 2, it used `DirectUpdateDoublesSketch` for both when vectorization was on and off, so the result seems reasonable as the query was faster with vectorization. However, for query 3, even though both used `DirectUpdateDoublesSketch`, the query was slower with vectorization. This seems because `MemoryOpenHashTable.copyTo()` was more expensive than `ByteBufferHashTable.adjustTableWhenFull()`. I'm not 100% sure about this yet, so will dig further.
```
query 2:
SELECT dimZipf, APPROX_QUANTILE_DS(sumFloatNormal, 0.5), DS_QUANTILES_SKETCH(maxLongUniform)
FROM foo
GROUP BY 1
master
Benchmark (query) (rowsPerSegment) (vectorize) Mode Cnt Score Error Units
SqlBenchmark.querySql 20 5000000 false avgt 15 617.615 ± 4.681 ms/op
SqlBenchmark.querySql 20 5000000 force avgt 15 512.804 ± 3.785 ms/op
PR
Benchmark (query) (rowsPerSegment) (vectorize) Mode Cnt Score Error Units
SqlBenchmark.querySql 20 5000000 false avgt 15 625.830 ± 8.074 ms/op
SqlBenchmark.querySql 20 5000000 force avgt 15 533.246 ± 3.985 ms/op
```
```
query 3:
SELECT dimSequential, dimZipf, APPROX_QUANTILE_DS(sumFloatNormal, 0.5)
FROM foo
GROUP BY 1, 2
* To avoid ResourceLimitExceededException, the buffer size and rowsPerSegment were set to 2GB and 100000, respectively.
master
Benchmark (query) (rowsPerSegment) (vectorize) Mode Cnt Score Error Units
SqlBenchmark.querySql 20 100000 false avgt 15 391.651 ± 1.849 ms/op
SqlBenchmark.querySql 20 100000 force avgt 15 671.631 ± 1.640 ms/op
PR
Benchmark (query) (rowsPerSegment) (vectorize) Mode Cnt Score Error Units
SqlBenchmark.querySql 20 100000 false avgt 15 395.111 ± 1.488 ms/op
SqlBenchmark.querySql 20 100000 force avgt 15 672.921 ± 2.680 ms/op
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org
[GitHub] [druid] didip commented on pull request #11574: Configurable maxStreamLength for doubles sketches
Posted by GitBox <gi...@apache.org>.
didip commented on pull request #11574:
URL: https://github.com/apache/druid/pull/11574#issuecomment-965812184
@jihoonson is this available on Druid 0.22.0?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org
[GitHub] [druid] jihoonson commented on a change in pull request #11574: Configurable maxStreamLength for doubles sketches
Posted by GitBox <gi...@apache.org>.
jihoonson commented on a change in pull request #11574:
URL: https://github.com/apache/druid/pull/11574#discussion_r695375994
##########
File path: docs/development/extensions-core/datasketches-quantiles.md
##########
@@ -56,6 +56,7 @@ The result of the aggregation is a DoublesSketch that is the union of all sketch
|name|A String for the output (result) name of the calculation.|yes|
|fieldName|A String for the name of the input field (can contain sketches or raw numeric values).|yes|
|k|Parameter that determines the accuracy and size of the sketch. Higher k means higher accuracy but more space to store sketches. Must be a power of 2 from 2 to 32768. See [accuracy information](https://datasketches.apache.org/docs/Quantiles/OrigQuantilesSketch) in the DataSketches documentation for details.|no, defaults to 128|
+|maxStreamLength|The max number of items that can be stored in each sketch. Currently, the query can throw `IllegalStateException` when one of the sketch hits this limit. See the [GitHub issue](https://github.com/apache/druid/issues/11544) for more details of the bug. A known workaround is using a higher max stream length. See [accuracy information](https://datasketches.apache.org/docs/Quantiles/OrigQuantilesSketch) in the DataSketches documentation for how many bytes are required per stream length. Since this parameter is added as a temporary solution to avoid the known issue, it can be removed in a future release when the bug is fixed.|no, defaults to 1,000,000,000|
Review comment:
I updated the description as suggested. @techdocsmith let me know if you have a better idea.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org
[GitHub] [druid] clintropolis merged pull request #11574: Configurable maxStreamLength for doubles sketches
Posted by GitBox <gi...@apache.org>.
clintropolis merged pull request #11574:
URL: https://github.com/apache/druid/pull/11574
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org
[GitHub] [druid] jihoonson commented on a change in pull request #11574: Configurable maxStreamLength for doubles sketches
Posted by GitBox <gi...@apache.org>.
jihoonson commented on a change in pull request #11574:
URL: https://github.com/apache/druid/pull/11574#discussion_r695375994
##########
File path: docs/development/extensions-core/datasketches-quantiles.md
##########
@@ -56,6 +56,7 @@ The result of the aggregation is a DoublesSketch that is the union of all sketch
|name|A String for the output (result) name of the calculation.|yes|
|fieldName|A String for the name of the input field (can contain sketches or raw numeric values).|yes|
|k|Parameter that determines the accuracy and size of the sketch. Higher k means higher accuracy but more space to store sketches. Must be a power of 2 from 2 to 32768. See [accuracy information](https://datasketches.apache.org/docs/Quantiles/OrigQuantilesSketch) in the DataSketches documentation for details.|no, defaults to 128|
+|maxStreamLength|The max number of items that can be stored in each sketch. Currently, the query can throw `IllegalStateException` when one of the sketch hits this limit. See the [GitHub issue](https://github.com/apache/druid/issues/11544) for more details of the bug. A known workaround is using a higher max stream length. See [accuracy information](https://datasketches.apache.org/docs/Quantiles/OrigQuantilesSketch) in the DataSketches documentation for how many bytes are required per stream length. Since this parameter is added as a temporary solution to avoid the known issue, it can be removed in a future release when the bug is fixed.|no, defaults to 1,000,000,000|
Review comment:
I updated the description as suggested.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org
[GitHub] [druid] clintropolis commented on a change in pull request #11574: Configurable maxStreamLength for doubles sketches
Posted by GitBox <gi...@apache.org>.
clintropolis commented on a change in pull request #11574:
URL: https://github.com/apache/druid/pull/11574#discussion_r694375452
##########
File path: extensions-core/datasketches/src/main/java/org/apache/druid/query/aggregation/datasketches/quantiles/DoublesSketches.java
##########
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.query.aggregation.datasketches.quantiles;
+
+import org.apache.druid.java.util.common.ISE;
+
+public final class DoublesSketches
+{
+ /**
+ * Runs the given task that updates a Doubles sketch backed by a direct byte buffer. This method intentionally
+ * accepts the update task as a {@link Runnable} instead of accpeting parameters of a sketch and other values
+ * needed for the update. This is to avoid any potential performance impact especially when the sketch is updated
Review comment:
did you measure if this approach has any perf impact?
##########
File path: docs/development/extensions-core/datasketches-quantiles.md
##########
@@ -56,6 +56,7 @@ The result of the aggregation is a DoublesSketch that is the union of all sketch
|name|A String for the output (result) name of the calculation.|yes|
|fieldName|A String for the name of the input field (can contain sketches or raw numeric values).|yes|
|k|Parameter that determines the accuracy and size of the sketch. Higher k means higher accuracy but more space to store sketches. Must be a power of 2 from 2 to 32768. See [accuracy information](https://datasketches.apache.org/docs/Quantiles/OrigQuantilesSketch) in the DataSketches documentation for details.|no, defaults to 128|
+|maxStreamLength|The max number of items that can be stored in each sketch. Currently, the query can throw `IllegalStateException` when one of the sketch hits this limit. See the [GitHub issue](https://github.com/apache/druid/issues/11544) for more details of the bug. A known workaround is using a higher max stream length. See [accuracy information](https://datasketches.apache.org/docs/Quantiles/OrigQuantilesSketch) in the DataSketches documentation for how many bytes are required per stream length. Since this parameter is added as a temporary solution to avoid the known issue, it can be removed in a future release when the bug is fixed.|no, defaults to 1,000,000,000|
Review comment:
It might be best to start off with the info that this parameter is a temporary workaround and link to the issue, but I'm not really sure what is best. @techdocsmith any suggestions?
maybe ".. can throw `IllegalStateException` when any one sketch reaches this limit. ...`
also, should we remove the ',' from the number or is that ok to put in the json?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org