You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2021/08/11 00:45:34 UTC

[GitHub] [druid] jihoonson opened a new pull request #11574: Configurable maxStreamLength for doubles sketches

jihoonson opened a new pull request #11574:
URL: https://github.com/apache/druid/pull/11574


   ### Description
   
   A temporary config to work around the issue filed in https://github.com/apache/druid/issues/11544. For native queries, a new parameter, `maxStreamLength` is added for `DoublesSketchAggregatorFactory`. For SQL, a new query context, `approxQuantileDsMaxStreamLength` is added. Note that `approxQuantileDsMaxStreamLength` does not control `maxStreamLength` per individual `DoublesSketchAggregatorFactory`, but applies the same value for all aggregator factories. The rationale for this decision is as below.
   
   - `approxQuantileDsMaxStreamLength` should not be used to optimize the memory usage of the sketch aggregator, but only used to work around #11544. 
   - Thus, I don't think users would ever set it to something smaller than default.
   - As noted in https://github.com/apache/druid/issues/11544#issuecomment-895525040, the sketch grows quite slowly at this scale.
   - As a result, it doesn't seem worth allowing users to set different `maxStreamLength` for each SQL function. I did it for native queries only because it was the least invasive.
   
   <hr>
   
   ##### Key changed/added classes in this PR
    * `DoublesSketches`
   
   <hr>
   
   <!-- Check the items by putting "x" in the brackets for the done things. Not all of these items apply to every PR. Remove the items which are not done or not relevant to the PR. None of the items from the checklist below are strictly necessary, but it would be very helpful if you at least self-review the PR. -->
   
   This PR has:
   - [x] been self-reviewed.
      - [ ] using the [concurrency checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md) (Remove this item if the PR doesn't have any relation to concurrency.)
   - [x] added documentation for new or modified features or behaviors.
   - [x] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in [licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
   - [x] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
   - [x] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met.
   - [ ] added integration tests.
   - [ ] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis merged pull request #11574: Configurable maxStreamLength for doubles sketches

Posted by GitBox <gi...@apache.org>.
clintropolis merged pull request #11574:
URL: https://github.com/apache/druid/pull/11574


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on a change in pull request #11574: Configurable maxStreamLength for doubles sketches

Posted by GitBox <gi...@apache.org>.
jihoonson commented on a change in pull request #11574:
URL: https://github.com/apache/druid/pull/11574#discussion_r695375561



##########
File path: extensions-core/datasketches/src/main/java/org/apache/druid/query/aggregation/datasketches/quantiles/DoublesSketches.java
##########
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.query.aggregation.datasketches.quantiles;
+
+import org.apache.druid.java.util.common.ISE;
+
+public final class DoublesSketches
+{
+  /**
+   * Runs the given task that updates a Doubles sketch backed by a direct byte buffer. This method intentionally
+   * accepts the update task as a {@link Runnable} instead of accpeting parameters of a sketch and other values
+   * needed for the update. This is to avoid any potential performance impact especially when the sketch is updated

Review comment:
       I ran the benchmark I added in https://github.com/apache/druid/pull/11574/commits/7be41051242f16db225dfce48f86adc9b1966c3a. It doesn't seem to have any difference.
   
   ```
   // 20: doubles sketches
   query: "SELECT APPROX_QUANTILE_DS(sumFloatNormal, 0.5), DS_QUANTILES_SKETCH(maxLongUniform) FROM foo"
   
   master
   
   Benchmark              (query)  (rowsPerSegment)  (vectorize)  Mode  Cnt    Score   Error  Units
   SqlBenchmark.querySql       20           5000000        false  avgt   15  341.659 ± 0.932  ms/op
   SqlBenchmark.querySql       20           5000000        force  avgt   15  446.572 ± 3.279  ms/op
   
   PR
   
   Benchmark              (query)  (rowsPerSegment)  (vectorize)  Mode  Cnt    Score   Error  Units
   SqlBenchmark.querySql       20           5000000        false  avgt   15  341.956 ± 2.478  ms/op
   SqlBenchmark.querySql       20           5000000        force  avgt   15  442.048 ± 1.420  ms/op
   ```
   
   Interestingly, the query was _slower_ when the vectorization was on. AFAIT, the difference is that the non-vectorized version uses `HeapUpdateDoublesSketch` while the vectorized one uses `DirectUpdateDoublesSketch`. Attaching flame graphs.
   
   Non-vectorized
   ![Screenshot from 2021-08-24 21-02-35](https://user-images.githubusercontent.com/2322288/130724807-18b430fb-ca4c-4644-aae3-20612b0cd0cf.png)
   
   Vectorized
   ![Screenshot from 2021-08-24 21-02-57](https://user-images.githubusercontent.com/2322288/130724816-74bdabe7-b7cf-439f-855a-2dd7dbd6460e.png)
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] techdocsmith commented on a change in pull request #11574: Configurable maxStreamLength for doubles sketches

Posted by GitBox <gi...@apache.org>.
techdocsmith commented on a change in pull request #11574:
URL: https://github.com/apache/druid/pull/11574#discussion_r695936382



##########
File path: docs/development/extensions-core/datasketches-quantiles.md
##########
@@ -56,6 +56,7 @@ The result of the aggregation is a DoublesSketch that is the union of all sketch
 |name|A String for the output (result) name of the calculation.|yes|
 |fieldName|A String for the name of the input field (can contain sketches or raw numeric values).|yes|
 |k|Parameter that determines the accuracy and size of the sketch. Higher k means higher accuracy but more space to store sketches. Must be a power of 2 from 2 to 32768. See [accuracy information](https://datasketches.apache.org/docs/Quantiles/OrigQuantilesSketch) in the DataSketches documentation for details.|no, defaults to 128|
+|maxStreamLength|The max number of items that can be stored in each sketch. Currently, the query can throw `IllegalStateException` when one of the sketch hits this limit. See the [GitHub issue](https://github.com/apache/druid/issues/11544) for more details of the bug. A known workaround is using a higher max stream length. See [accuracy information](https://datasketches.apache.org/docs/Quantiles/OrigQuantilesSketch) in the DataSketches documentation for how many bytes are required per stream length. Since this parameter is added as a temporary solution to avoid the known issue, it can be removed in a future release when the bug is fixed.|no, defaults to 1,000,000,000|

Review comment:
       ```suggestion
   |maxStreamLength|The maximum number of items to store in each sketch. If one of the sketches hits this limit, the query can throw an `IllegalStateException`. See the [GitHub issue 11544](https://github.com/apache/druid/issues/11544) for more details. To workaround this issue, increase the max stream length. See [accuracy information](https://datasketches.apache.org/docs/Quantiles/OrigQuantilesSketch) in the DataSketches documentation for how many bytes are required per stream length. This parameter is added as a temporary solution to avoid the known issue. It may be removed in a future release after the bug is fixed.|no, defaults to 1,000,000,000|
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] techdocsmith commented on a change in pull request #11574: Configurable maxStreamLength for doubles sketches

Posted by GitBox <gi...@apache.org>.
techdocsmith commented on a change in pull request #11574:
URL: https://github.com/apache/druid/pull/11574#discussion_r695939705



##########
File path: docs/development/extensions-core/datasketches-quantiles.md
##########
@@ -56,6 +56,7 @@ The result of the aggregation is a DoublesSketch that is the union of all sketch
 |name|A String for the output (result) name of the calculation.|yes|
 |fieldName|A String for the name of the input field (can contain sketches or raw numeric values).|yes|
 |k|Parameter that determines the accuracy and size of the sketch. Higher k means higher accuracy but more space to store sketches. Must be a power of 2 from 2 to 32768. See [accuracy information](https://datasketches.apache.org/docs/Quantiles/OrigQuantilesSketch) in the DataSketches documentation for details.|no, defaults to 128|
+|maxStreamLength|This parameter is added as a temporary solution to avoid a [known issue](https://github.com/apache/druid/issues/11544), and can be removed in a future release when the bug is fixed. With this parameter, you can control the max number of items that can be stored in each sketch. Currently, the query can throw `IllegalStateException` when one of the sketch reaches this limit. A known workaround is using a higher max stream length. See [accuracy information](https://datasketches.apache.org/docs/Quantiles/OrigQuantilesSketch) in the DataSketches documentation for how many bytes are required per stream length.|no, defaults to 1000000000|

Review comment:
       ```suggestion
   |maxStreamLength|This parameter is a temporary solution to avoid a [known issue](https://github.com/apache/druid/issues/11544). It may be removed in a future release after the bug is fixed. This parameter defines the maximum number of items to store in each sketch. If a sketch reaches the limit, the query can throw `IllegalStateException`. To workaround this issue, increase the maximum stream length. See [accuracy information](https://datasketches.apache.org/docs/Quantiles/OrigQuantilesSketch) in the DataSketches documentation for how many bytes are required per stream length.|no, defaults to 1000000000|
   ```

##########
File path: docs/querying/sql.md
##########
@@ -339,9 +339,9 @@ Only the COUNT, ARRAY_AGG, and STRING_AGG aggregations can accept the DISTINCT k
 |`DS_HLL(expr, [lgK, tgtHllType])`|Creates an [HLL sketch](../development/extensions-core/datasketches-hll.md) on the values of expr, which can be a regular column or a column containing HLL sketches. The `lgK` and `tgtHllType` parameters are described in the HLL sketch documentation. The [DataSketches extension](../development/extensions-core/datasketches-extension.md) must be loaded to use this function.|`'0'` (STRING)|
 |`DS_THETA(expr, [size])`|Creates a [Theta sketch](../development/extensions-core/datasketches-theta.md) on the values of expr, which can be a regular column or a column containing Theta sketches. The `size` parameter is described in the Theta sketch documentation. The [DataSketches extension](../development/extensions-core/datasketches-extension.md) must be loaded to use this function.|`'0.0'` (STRING)|
 |`APPROX_QUANTILE(expr, probability, [resolution])`|_Deprecated._ Use `APPROX_QUANTILE_DS` instead, which provides a superior distribution-independent algorithm with formal error guarantees.<br/><br/>Computes approximate quantiles on numeric or [approxHistogram](../development/extensions-core/approximate-histograms.md#approximate-histogram-aggregator) exprs. The "probability" should be between 0 and 1 (exclusive). The "resolution" is the number of centroids to use for the computation. Higher resolutions will give more precise results but also have higher overhead. If not provided, the default resolution is 50. The [approximate histogram extension](../development/extensions-core/approximate-histograms.md) must be loaded to use this function.|`NaN`|
-|`APPROX_QUANTILE_DS(expr, probability, [k])`|Computes approximate quantiles on numeric or [Quantiles sketch](../development/extensions-core/datasketches-quantiles.md) exprs. The "probability" should be between 0 and 1 (exclusive). The `k` parameter is described in the Quantiles sketch documentation. The [DataSketches extension](../development/extensions-core/datasketches-extension.md) must be loaded to use this function.|`NaN`|
+|`APPROX_QUANTILE_DS(expr, probability, [k])`|Computes approximate quantiles on numeric or [Quantiles sketch](../development/extensions-core/datasketches-quantiles.md) exprs. The "probability" should be between 0 and 1 (exclusive). The `k` parameter is described in the Quantiles sketch documentation. The [DataSketches extension](../development/extensions-core/datasketches-extension.md) must be loaded to use this function.<br/><br/>See the [below section](#a-known-issue-with-approximate-functions-based-on-data-sketches) for a known issue with using this function.|`NaN`|

Review comment:
       ```suggestion
   |`APPROX_QUANTILE_DS(expr, probability, [k])`|Computes approximate quantiles on numeric or [Quantiles sketch](../development/extensions-core/datasketches-quantiles.md) exprs. Allowable "probability" values are between 0 and 1, exclusive. The `k` parameter is described in the Quantiles sketch documentation. You must load [DataSketches extension](../development/extensions-core/datasketches-extension.md) to use this function.<br/><br/>See the [known issue](#a-known-issue-with-approximate-functions-based-on-data-sketches) with this function.|`NaN`|
   ```

##########
File path: docs/querying/sql.md
##########
@@ -820,6 +820,16 @@ either through query context or through Broker configuration.
 - Aggregation functions that are labeled as using sketches or approximations, such as APPROX_COUNT_DISTINCT, are always
 approximate, regardless of configuration.
 
+#### A known issue with approximate functions based on data sketches
+
+The `APPROX_QUANTILE_DS` and `DS_QUANTILES_SKETCH` functions can fail with `IllegalStateException` if one of the sketches created
+during the query hits `maxStreamLength`, which is the max number of items that can be stored in each sketch.

Review comment:
       ```suggestion
   the query hits `maxStreamLength`: the maximum number of items to store in each sketch.
   ```

##########
File path: docs/querying/sql.md
##########
@@ -339,9 +339,9 @@ Only the COUNT, ARRAY_AGG, and STRING_AGG aggregations can accept the DISTINCT k
 |`DS_HLL(expr, [lgK, tgtHllType])`|Creates an [HLL sketch](../development/extensions-core/datasketches-hll.md) on the values of expr, which can be a regular column or a column containing HLL sketches. The `lgK` and `tgtHllType` parameters are described in the HLL sketch documentation. The [DataSketches extension](../development/extensions-core/datasketches-extension.md) must be loaded to use this function.|`'0'` (STRING)|
 |`DS_THETA(expr, [size])`|Creates a [Theta sketch](../development/extensions-core/datasketches-theta.md) on the values of expr, which can be a regular column or a column containing Theta sketches. The `size` parameter is described in the Theta sketch documentation. The [DataSketches extension](../development/extensions-core/datasketches-extension.md) must be loaded to use this function.|`'0.0'` (STRING)|
 |`APPROX_QUANTILE(expr, probability, [resolution])`|_Deprecated._ Use `APPROX_QUANTILE_DS` instead, which provides a superior distribution-independent algorithm with formal error guarantees.<br/><br/>Computes approximate quantiles on numeric or [approxHistogram](../development/extensions-core/approximate-histograms.md#approximate-histogram-aggregator) exprs. The "probability" should be between 0 and 1 (exclusive). The "resolution" is the number of centroids to use for the computation. Higher resolutions will give more precise results but also have higher overhead. If not provided, the default resolution is 50. The [approximate histogram extension](../development/extensions-core/approximate-histograms.md) must be loaded to use this function.|`NaN`|
-|`APPROX_QUANTILE_DS(expr, probability, [k])`|Computes approximate quantiles on numeric or [Quantiles sketch](../development/extensions-core/datasketches-quantiles.md) exprs. The "probability" should be between 0 and 1 (exclusive). The `k` parameter is described in the Quantiles sketch documentation. The [DataSketches extension](../development/extensions-core/datasketches-extension.md) must be loaded to use this function.|`NaN`|
+|`APPROX_QUANTILE_DS(expr, probability, [k])`|Computes approximate quantiles on numeric or [Quantiles sketch](../development/extensions-core/datasketches-quantiles.md) exprs. The "probability" should be between 0 and 1 (exclusive). The `k` parameter is described in the Quantiles sketch documentation. The [DataSketches extension](../development/extensions-core/datasketches-extension.md) must be loaded to use this function.<br/><br/>See the [below section](#a-known-issue-with-approximate-functions-based-on-data-sketches) for a known issue with using this function.|`NaN`|
 |`APPROX_QUANTILE_FIXED_BUCKETS(expr, probability, numBuckets, lowerLimit, upperLimit, [outlierHandlingMode])`|Computes approximate quantiles on numeric or [fixed buckets histogram](../development/extensions-core/approximate-histograms.md#fixed-buckets-histogram) exprs. The "probability" should be between 0 and 1 (exclusive). The `numBuckets`, `lowerLimit`, `upperLimit`, and `outlierHandlingMode` parameters are described in the fixed buckets histogram documentation. The [approximate histogram extension](../development/extensions-core/approximate-histograms.md) must be loaded to use this function.|`0.0`|
-|`DS_QUANTILES_SKETCH(expr, [k])`|Creates a [Quantiles sketch](../development/extensions-core/datasketches-quantiles.md) on the values of expr, which can be a regular column or a column containing quantiles sketches. The `k` parameter is described in the Quantiles sketch documentation. The [DataSketches extension](../development/extensions-core/datasketches-extension.md) must be loaded to use this function.|`'0'` (STRING)|
+|`DS_QUANTILES_SKETCH(expr, [k])`|Creates a [Quantiles sketch](../development/extensions-core/datasketches-quantiles.md) on the values of expr, which can be a regular column or a column containing quantiles sketches. The `k` parameter is described in the Quantiles sketch documentation. The [DataSketches extension](../development/extensions-core/datasketches-extension.md) must be loaded to use this function.<br/><br/>See the [below section](#a-known-issue-with-approximate-functions-based-on-data-sketches) for a known issue with using this function.|`'0'` (STRING)|

Review comment:
       ```suggestion
   |`DS_QUANTILES_SKETCH(expr, [k])`|Creates a [Quantiles sketch](../development/extensions-core/datasketches-quantiles.md) on the values of `expr`, which can be a regular column or a column containing quantiles sketches. The `k` parameter is described in the Quantiles sketch documentation. You must load the [DataSketches extension](../development/extensions-core/datasketches-extension.md) to use this function.<br/><br/>See the [known issue](#a-known-issue-with-approximate-functions-based-on-data-sketches) with this function.|`'0'` (STRING)|
   ```

##########
File path: docs/querying/sql.md
##########
@@ -820,6 +820,16 @@ either through query context or through Broker configuration.
 - Aggregation functions that are labeled as using sketches or approximations, such as APPROX_COUNT_DISTINCT, are always
 approximate, regardless of configuration.
 
+#### A known issue with approximate functions based on data sketches
+
+The `APPROX_QUANTILE_DS` and `DS_QUANTILES_SKETCH` functions can fail with `IllegalStateException` if one of the sketches created
+during the query hits `maxStreamLength`, which is the max number of items that can be stored in each sketch.
+See the [GitHub issue](https://github.com/apache/druid/issues/11544) for more details of the bug.
+A known workaround is raising the max stream length to something higher by setting `approxQuantileDsMaxStreamLength`
+in the query context. Since it is set to 1,000,000,000 by default, you don't need to override it in most cases.
+See [accuracy information](https://datasketches.apache.org/docs/Quantiles/OrigQuantilesSketch) in the DataSketches documentation for how many bytes are required per stream length.
+Since this query context is added as a temporary solution to avoid the known issue, it can be removed in a future release when the bug is fixed.

Review comment:
       ```suggestion
   This query context  parameter is a temporary solution to avoid the known issue. It may be removed in a future release after the bug is fixed.
   ```

##########
File path: docs/querying/sql.md
##########
@@ -820,6 +820,16 @@ either through query context or through Broker configuration.
 - Aggregation functions that are labeled as using sketches or approximations, such as APPROX_COUNT_DISTINCT, are always
 approximate, regardless of configuration.
 
+#### A known issue with approximate functions based on data sketches
+
+The `APPROX_QUANTILE_DS` and `DS_QUANTILES_SKETCH` functions can fail with `IllegalStateException` if one of the sketches created

Review comment:
       ```suggestion
   The `APPROX_QUANTILE_DS` and `DS_QUANTILES_SKETCH` functions can fail with an `IllegalStateException` if one of the sketches for
   ```

##########
File path: docs/querying/sql.md
##########
@@ -820,6 +820,16 @@ either through query context or through Broker configuration.
 - Aggregation functions that are labeled as using sketches or approximations, such as APPROX_COUNT_DISTINCT, are always
 approximate, regardless of configuration.
 
+#### A known issue with approximate functions based on data sketches
+
+The `APPROX_QUANTILE_DS` and `DS_QUANTILES_SKETCH` functions can fail with `IllegalStateException` if one of the sketches created
+during the query hits `maxStreamLength`, which is the max number of items that can be stored in each sketch.
+See the [GitHub issue](https://github.com/apache/druid/issues/11544) for more details of the bug.

Review comment:
       ```suggestion
   See [GitHub issue 11544](https://github.com/apache/druid/issues/11544) for more details.
   ```

##########
File path: docs/querying/sql.md
##########
@@ -820,6 +820,16 @@ either through query context or through Broker configuration.
 - Aggregation functions that are labeled as using sketches or approximations, such as APPROX_COUNT_DISTINCT, are always
 approximate, regardless of configuration.
 
+#### A known issue with approximate functions based on data sketches
+
+The `APPROX_QUANTILE_DS` and `DS_QUANTILES_SKETCH` functions can fail with `IllegalStateException` if one of the sketches created
+during the query hits `maxStreamLength`, which is the max number of items that can be stored in each sketch.
+See the [GitHub issue](https://github.com/apache/druid/issues/11544) for more details of the bug.
+A known workaround is raising the max stream length to something higher by setting `approxQuantileDsMaxStreamLength`

Review comment:
       ```suggestion
   To workaround the issue, increase value of the maximum string length with the `approxQuantileDsMaxStreamLength` parameter
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] techdocsmith commented on a change in pull request #11574: Configurable maxStreamLength for doubles sketches

Posted by GitBox <gi...@apache.org>.
techdocsmith commented on a change in pull request #11574:
URL: https://github.com/apache/druid/pull/11574#discussion_r695936382



##########
File path: docs/development/extensions-core/datasketches-quantiles.md
##########
@@ -56,6 +56,7 @@ The result of the aggregation is a DoublesSketch that is the union of all sketch
 |name|A String for the output (result) name of the calculation.|yes|
 |fieldName|A String for the name of the input field (can contain sketches or raw numeric values).|yes|
 |k|Parameter that determines the accuracy and size of the sketch. Higher k means higher accuracy but more space to store sketches. Must be a power of 2 from 2 to 32768. See [accuracy information](https://datasketches.apache.org/docs/Quantiles/OrigQuantilesSketch) in the DataSketches documentation for details.|no, defaults to 128|
+|maxStreamLength|The max number of items that can be stored in each sketch. Currently, the query can throw `IllegalStateException` when one of the sketch hits this limit. See the [GitHub issue](https://github.com/apache/druid/issues/11544) for more details of the bug. A known workaround is using a higher max stream length. See [accuracy information](https://datasketches.apache.org/docs/Quantiles/OrigQuantilesSketch) in the DataSketches documentation for how many bytes are required per stream length. Since this parameter is added as a temporary solution to avoid the known issue, it can be removed in a future release when the bug is fixed.|no, defaults to 1,000,000,000|

Review comment:
       ```suggestion
   |maxStreamLength|The maximum number of items to store in each sketch. If one of the sketches hits this limit, the query can throw an `IllegalStateException`. See the [GitHub issue 11544](https://github.com/apache/druid/issues/11544) for more details. To workaround this issue, increase the max stream length. See [accuracy information](https://datasketches.apache.org/docs/Quantiles/OrigQuantilesSketch) in the DataSketches documentation for how many bytes are required per stream length. This parameter is a temporary solution to avoid the known issue. It may be removed in a future release after the bug is fixed.|no, defaults to 1,000,000,000|
   ```
   Some grammar / style changes. Otherwise LGTM




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on a change in pull request #11574: Configurable maxStreamLength for doubles sketches

Posted by GitBox <gi...@apache.org>.
jihoonson commented on a change in pull request #11574:
URL: https://github.com/apache/druid/pull/11574#discussion_r695418946



##########
File path: extensions-core/datasketches/src/main/java/org/apache/druid/query/aggregation/datasketches/quantiles/DoublesSketches.java
##########
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.query.aggregation.datasketches.quantiles;
+
+import org.apache.druid.java.util.common.ISE;
+
+public final class DoublesSketches
+{
+  /**
+   * Runs the given task that updates a Doubles sketch backed by a direct byte buffer. This method intentionally
+   * accepts the update task as a {@link Runnable} instead of accpeting parameters of a sketch and other values
+   * needed for the update. This is to avoid any potential performance impact especially when the sketch is updated

Review comment:
       I tested a couple of more queries that produce more than one row. Looking at these results, the change in this PR doesn't seem to introduce huge overhead. However, the performance of these queries does seem interesting when vectorization is on. For query 2, it used `DirectUpdateDoublesSketch` for both when vectorization was on and off, so the result seems reasonable as the query was faster with vectorization. However, for query 3, even though both used `DirectUpdateDoublesSketch`, the query was slower with vectorization. This seems because `MemoryOpenHashTable.copyTo()` was more expensive than `ByteBufferHashTable.adjustTableWhenFull()`. I'm not 100% sure about this yet, so will dig further. 
   
   ```
   query 2: 
   SELECT dimZipf, APPROX_QUANTILE_DS(sumFloatNormal, 0.5), DS_QUANTILES_SKETCH(maxLongUniform)
   FROM foo
   GROUP BY 1
   
   master
   
   Benchmark              (query)  (rowsPerSegment)  (vectorize)  Mode  Cnt    Score   Error  Units
   SqlBenchmark.querySql       20           5000000        false  avgt   15  617.615 ± 4.681  ms/op
   SqlBenchmark.querySql       20           5000000        force  avgt   15  512.804 ± 3.785  ms/op
   
   PR
   
   Benchmark              (query)  (rowsPerSegment)  (vectorize)  Mode  Cnt    Score   Error  Units
   SqlBenchmark.querySql       20           5000000        false  avgt   15  625.830 ± 8.074  ms/op
   SqlBenchmark.querySql       20           5000000        force  avgt   15  533.246 ± 3.985  ms/op
   ```
   
   ```
   query 3: 
   SELECT dimSequential, dimZipf, APPROX_QUANTILE_DS(sumFloatNormal, 0.5)
   FROM foo
   GROUP BY 1, 2
   
   * To avoid ResourceLimitExceededException, the buffer size and rowsPerSegment were set to 2GB and 100000, respectively. 
   
   master
   
   Benchmark              (query)  (rowsPerSegment)  (vectorize)  Mode  Cnt    Score   Error  Units
   SqlBenchmark.querySql       20            100000        false  avgt   15  391.651 ± 1.849  ms/op
   SqlBenchmark.querySql       20            100000        force  avgt   15  671.631 ± 1.640  ms/op
   
   PR
   
   Benchmark              (query)  (rowsPerSegment)  (vectorize)  Mode  Cnt    Score   Error  Units
   SqlBenchmark.querySql       20            100000        false  avgt   15  395.111 ± 1.488  ms/op
   SqlBenchmark.querySql       20            100000        force  avgt   15  672.921 ± 2.680  ms/op
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] didip commented on pull request #11574: Configurable maxStreamLength for doubles sketches

Posted by GitBox <gi...@apache.org>.
didip commented on pull request #11574:
URL: https://github.com/apache/druid/pull/11574#issuecomment-965812184


   @jihoonson is this available on Druid 0.22.0?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on a change in pull request #11574: Configurable maxStreamLength for doubles sketches

Posted by GitBox <gi...@apache.org>.
jihoonson commented on a change in pull request #11574:
URL: https://github.com/apache/druid/pull/11574#discussion_r695375994



##########
File path: docs/development/extensions-core/datasketches-quantiles.md
##########
@@ -56,6 +56,7 @@ The result of the aggregation is a DoublesSketch that is the union of all sketch
 |name|A String for the output (result) name of the calculation.|yes|
 |fieldName|A String for the name of the input field (can contain sketches or raw numeric values).|yes|
 |k|Parameter that determines the accuracy and size of the sketch. Higher k means higher accuracy but more space to store sketches. Must be a power of 2 from 2 to 32768. See [accuracy information](https://datasketches.apache.org/docs/Quantiles/OrigQuantilesSketch) in the DataSketches documentation for details.|no, defaults to 128|
+|maxStreamLength|The max number of items that can be stored in each sketch. Currently, the query can throw `IllegalStateException` when one of the sketch hits this limit. See the [GitHub issue](https://github.com/apache/druid/issues/11544) for more details of the bug. A known workaround is using a higher max stream length. See [accuracy information](https://datasketches.apache.org/docs/Quantiles/OrigQuantilesSketch) in the DataSketches documentation for how many bytes are required per stream length. Since this parameter is added as a temporary solution to avoid the known issue, it can be removed in a future release when the bug is fixed.|no, defaults to 1,000,000,000|

Review comment:
       I updated the description as suggested. @techdocsmith let me know if you have a better idea.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis merged pull request #11574: Configurable maxStreamLength for doubles sketches

Posted by GitBox <gi...@apache.org>.
clintropolis merged pull request #11574:
URL: https://github.com/apache/druid/pull/11574


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on a change in pull request #11574: Configurable maxStreamLength for doubles sketches

Posted by GitBox <gi...@apache.org>.
jihoonson commented on a change in pull request #11574:
URL: https://github.com/apache/druid/pull/11574#discussion_r695375994



##########
File path: docs/development/extensions-core/datasketches-quantiles.md
##########
@@ -56,6 +56,7 @@ The result of the aggregation is a DoublesSketch that is the union of all sketch
 |name|A String for the output (result) name of the calculation.|yes|
 |fieldName|A String for the name of the input field (can contain sketches or raw numeric values).|yes|
 |k|Parameter that determines the accuracy and size of the sketch. Higher k means higher accuracy but more space to store sketches. Must be a power of 2 from 2 to 32768. See [accuracy information](https://datasketches.apache.org/docs/Quantiles/OrigQuantilesSketch) in the DataSketches documentation for details.|no, defaults to 128|
+|maxStreamLength|The max number of items that can be stored in each sketch. Currently, the query can throw `IllegalStateException` when one of the sketch hits this limit. See the [GitHub issue](https://github.com/apache/druid/issues/11544) for more details of the bug. A known workaround is using a higher max stream length. See [accuracy information](https://datasketches.apache.org/docs/Quantiles/OrigQuantilesSketch) in the DataSketches documentation for how many bytes are required per stream length. Since this parameter is added as a temporary solution to avoid the known issue, it can be removed in a future release when the bug is fixed.|no, defaults to 1,000,000,000|

Review comment:
       I updated the description as suggested.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on a change in pull request #11574: Configurable maxStreamLength for doubles sketches

Posted by GitBox <gi...@apache.org>.
clintropolis commented on a change in pull request #11574:
URL: https://github.com/apache/druid/pull/11574#discussion_r694375452



##########
File path: extensions-core/datasketches/src/main/java/org/apache/druid/query/aggregation/datasketches/quantiles/DoublesSketches.java
##########
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.query.aggregation.datasketches.quantiles;
+
+import org.apache.druid.java.util.common.ISE;
+
+public final class DoublesSketches
+{
+  /**
+   * Runs the given task that updates a Doubles sketch backed by a direct byte buffer. This method intentionally
+   * accepts the update task as a {@link Runnable} instead of accpeting parameters of a sketch and other values
+   * needed for the update. This is to avoid any potential performance impact especially when the sketch is updated

Review comment:
       did you measure if this approach has any perf impact?

##########
File path: docs/development/extensions-core/datasketches-quantiles.md
##########
@@ -56,6 +56,7 @@ The result of the aggregation is a DoublesSketch that is the union of all sketch
 |name|A String for the output (result) name of the calculation.|yes|
 |fieldName|A String for the name of the input field (can contain sketches or raw numeric values).|yes|
 |k|Parameter that determines the accuracy and size of the sketch. Higher k means higher accuracy but more space to store sketches. Must be a power of 2 from 2 to 32768. See [accuracy information](https://datasketches.apache.org/docs/Quantiles/OrigQuantilesSketch) in the DataSketches documentation for details.|no, defaults to 128|
+|maxStreamLength|The max number of items that can be stored in each sketch. Currently, the query can throw `IllegalStateException` when one of the sketch hits this limit. See the [GitHub issue](https://github.com/apache/druid/issues/11544) for more details of the bug. A known workaround is using a higher max stream length. See [accuracy information](https://datasketches.apache.org/docs/Quantiles/OrigQuantilesSketch) in the DataSketches documentation for how many bytes are required per stream length. Since this parameter is added as a temporary solution to avoid the known issue, it can be removed in a future release when the bug is fixed.|no, defaults to 1,000,000,000|

Review comment:
       It might be best to start off with the info that this parameter is a temporary workaround and link to the issue, but I'm not really sure what is best. @techdocsmith any suggestions?
   
   maybe ".. can throw `IllegalStateException` when any one sketch reaches this limit. ...`
   
   also, should we remove the ',' from the number or is that ok to put in the json?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org