You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2022/09/14 22:24:05 UTC

[GitHub] [druid] cloventt opened a new pull request, #13088: Add a note to the documentation about pre-built HLLSketches

cloventt opened a new pull request, #13088:
URL: https://github.com/apache/druid/pull/13088

   <!-- Thanks for trying to help us make Apache Druid be the best it can be! Please fill out as much of the following information as is possible (where relevant, and remove it when irrelevant) to help make the intention and scope of this PR clear in order to ease review. -->
   
   <!-- Please read the doc for contribution (https://github.com/apache/druid/blob/master/CONTRIBUTING.md) before making this PR. Also, once you open a PR, please _avoid using force pushes and rebasing_ since these make it difficult for reviewers to see what you've changed in response to their reviews. See [the 'If your pull request shows conflicts with master' section](https://github.com/apache/druid/blob/master/CONTRIBUTING.md#if-your-pull-request-shows-conflicts-with-master) for more details. -->
   
   <!-- Replace XXXX with the id of the issue fixed in this PR. Remove this section if there is no corresponding issue. Don't reference the issue in the title of this pull-request. -->
   
   <!-- If you are a committer, follow the PR action item checklist for committers:
   https://github.com/apache/druid/blob/master/dev/committer-instructions.md#pr-and-issue-action-item-checklist-for-committers. -->
   
   ### Description
   
   Druid actually supports ingesting a pre-generated sketch column by using the `HLLSketchMerge` aggregator. However, this functionality was previously not made clear in the documentation.
   
   <!-- Describe the goal of this PR, what problem are you fixing. If there is a corresponding issue (referenced above), it's not necessary to repeat the description here, however, you may choose to keep one summary sentence. -->
   
   <!-- Describe your patch: what did you change in code? How did you fix the problem? -->
   
   <!-- If there are several relatively logically separate changes in this PR, create a mini-section for each of them. For example: -->
   
   <!--
   In each section, please describe design decisions made, including:
    - Choice of algorithms
    - Behavioral aspects. What configuration values are acceptable? How are corner cases and error conditions handled, such as when there are insufficient resources?
    - Class organization and design (how the logic is split between classes, inheritance, composition, design patterns)
    - Method organization and design (how the logic is split between methods, parameters and return types)
    - Naming (class, method, API, configuration, HTTP endpoint, names of emitted metrics)
   -->
   
   
   <!-- It's good to describe an alternative design (or mention an alternative name) for every design (or naming) decision point and compare the alternatives with the designs that you've implemented (or the names you've chosen) to highlight the advantages of the chosen designs and names. -->
   
   <!-- If there was a discussion of the design of the feature implemented in this PR elsewhere (e. g. a "Proposal" issue, any other issue, or a thread in the development mailing list), link to that discussion from this PR description and explain what have changed in your final design compared to your original proposal or the consensus version in the end of the discussion. If something hasn't changed since the original discussion, you can omit a detailed discussion of those aspects of the design here, perhaps apart from brief mentioning for the sake of readability of this PR description. -->
   
   <!-- Some of the aspects mentioned above may be omitted for simple and small changes. -->
   
   <hr>
   
   ##### Key changed/added classes in this PR
    * None
   
   <hr>
   
   <!-- Check the items by putting "x" in the brackets for the done things. Not all of these items apply to every PR. Remove the items which are not done or not relevant to the PR. None of the items from the checklist below are strictly necessary, but it would be very helpful if you at least self-review the PR. -->
   
   This PR has:
   - [x] been self-reviewed.
   - [x] added documentation for new or modified features or behaviors.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in [licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
   - [ ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
   - [ ] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met.
   - [ ] added integration tests.
   - [ ] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] techdocsmith commented on a diff in pull request #13088: Add a note to the documentation about pre-built HLLSketches

Posted by GitBox <gi...@apache.org>.
techdocsmith commented on code in PR #13088:
URL: https://github.com/apache/druid/pull/13088#discussion_r972529744


##########
docs/development/extensions-core/datasketches-hll.md:
##########
@@ -59,6 +59,11 @@ druid.extensions.loadList=["druid-datasketches"]
  }
 ```
 
+The `HLLSketchBuild` aggregator builds a datasketch from the input column specified. If used during ingestion, this
+will result in Druid storing pre-generated HLL Sketch objects in the datasource, rather than the original value itself.
+If used at query time on an existing dimension, the resulting column can be used as an intermediate dimension by the
+post-aggregators below.
+

Review Comment:
   ```suggestion
   The `HLLSketchBuild` aggregator builds an HLL sketch object from the specified input column. When used during ingestion, Druid stores pre-generated HLL sketch objects in the datasource instead of the raw data from the input column.
   When applied at query time on an existing dimension, you can use the resulting column as an intermediate dimension by the [post-aggregators](#post-aggregators).
   
   ```
   Thanks for the clarification/ contribution @cloventt ! I've suggest some stylistic changes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] vtlim commented on a diff in pull request #13088: Add a note to the documentation about pre-built HLLSketches

Posted by GitBox <gi...@apache.org>.
vtlim commented on code in PR #13088:
URL: https://github.com/apache/druid/pull/13088#discussion_r973368695


##########
docs/development/extensions-core/datasketches-hll.md:
##########
@@ -59,6 +59,11 @@ druid.extensions.loadList=["druid-datasketches"]
  }
 ```
 
+The `HLLSketchBuild` aggregator builds a datasketch from the input column specified. If used during ingestion, this
+will result in Druid storing pre-generated HLL Sketch objects in the datasource, rather than the original value itself.
+If used at query time on an existing dimension, the resulting column can be used as an intermediate dimension by the
+post-aggregators below.
+

Review Comment:
   I made a few minor edits to the suggestion above for consistency of language and style. One question: What does "original values" refer to? The raw data that made up the sketch?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] techdocsmith commented on a diff in pull request #13088: Add a note to the documentation about pre-built HLLSketches

Posted by GitBox <gi...@apache.org>.
techdocsmith commented on code in PR #13088:
URL: https://github.com/apache/druid/pull/13088#discussion_r972529744


##########
docs/development/extensions-core/datasketches-hll.md:
##########
@@ -59,6 +59,11 @@ druid.extensions.loadList=["druid-datasketches"]
  }
 ```
 
+The `HLLSketchBuild` aggregator builds a datasketch from the input column specified. If used during ingestion, this
+will result in Druid storing pre-generated HLL Sketch objects in the datasource, rather than the original value itself.
+If used at query time on an existing dimension, the resulting column can be used as an intermediate dimension by the
+post-aggregators below.
+

Review Comment:
   ```suggestion
   The `HLLSketchBuild` aggregator builds a datasketch from the specified input column. When used during ingestion, Druid stores pre-generated HLL Sketch objects in the datasource instead of the original values.
   When applied at query time on an existing dimension, you can use the resulting column as an intermediate dimension by the [post-aggregators](#post-aggregators).
   
   ```
   Thanks for the clarification/ contribution @cloventt ! I've suggest some stylistic changes.



##########
docs/development/extensions-core/datasketches-hll.md:
##########
@@ -89,6 +94,11 @@ druid.extensions.loadList=["druid-datasketches"]
  }
 ```
 
+The `HLLSketchMerge` aggregator can be used to ingest pre-generated sketches from an input dataset. For example, an
+earlier batch processing job can be used to generate the sketches before the data is sent to Druid. To support this
+behaviour, the sketches in the input dataset must be serialised to base64-encoded bytes. Then, in the native ingestion
+`MetricsSpec` the `HLLSketchMerge` must be specified for the input column as shown above.
+

Review Comment:
   ```suggestion
   You can use the `HLLSketchMerge` aggregator to ingest pre-generated sketches from an input dataset. For example, you can set up a batch processing job to generate the sketches before sending the data to Druid. You must serialize the sketches in the input dataset to base-64 encoded bytes. Then, specify `HLLSketchMerge` for the input column in the native ingestion`MetricsSpec`.
   
   ```
   Stylistic suggestions. Also wonder if it might be helpful to have an example of the `MetricsSpec`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] techdocsmith commented on a diff in pull request #13088: Add a note to the documentation about pre-built HLLSketches

Posted by GitBox <gi...@apache.org>.
techdocsmith commented on code in PR #13088:
URL: https://github.com/apache/druid/pull/13088#discussion_r972530970


##########
docs/development/extensions-core/datasketches-hll.md:
##########
@@ -89,6 +94,11 @@ druid.extensions.loadList=["druid-datasketches"]
  }
 ```
 
+The `HLLSketchMerge` aggregator can be used to ingest pre-generated sketches from an input dataset. For example, an
+earlier batch processing job can be used to generate the sketches before the data is sent to Druid. To support this
+behaviour, the sketches in the input dataset must be serialised to base64-encoded bytes. Then, in the native ingestion
+`MetricsSpec` the `HLLSketchMerge` must be specified for the input column as shown above.
+

Review Comment:
   ```suggestion
   You can use the `HLLSketchMerge` aggregator to ingest pre-generated sketches from an input dataset. For example, you can set up a batch processing job to generate the sketches before sending the data to Druid. You must serialize the sketches in the input dataset to Base64-encoded bytes. Then, specify `HLLSketchMerge` for the input column in the native ingestion `metricsSpec`.
   
   ```
   Stylistic suggestions. Also wonder if it might be helpful to have an example of the `MetricsSpec`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] cloventt commented on a diff in pull request #13088: Add a note to the documentation about pre-built HLLSketches

Posted by GitBox <gi...@apache.org>.
cloventt commented on code in PR #13088:
URL: https://github.com/apache/druid/pull/13088#discussion_r973784759


##########
docs/development/extensions-core/datasketches-hll.md:
##########
@@ -59,6 +59,11 @@ druid.extensions.loadList=["druid-datasketches"]
  }
 ```
 
+The `HLLSketchBuild` aggregator builds a datasketch from the input column specified. If used during ingestion, this
+will result in Druid storing pre-generated HLL Sketch objects in the datasource, rather than the original value itself.
+If used at query time on an existing dimension, the resulting column can be used as an intermediate dimension by the
+post-aggregators below.
+

Review Comment:
   Style changes commited in df7d508.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] cloventt commented on a diff in pull request #13088: Add a note to the documentation about pre-built HLLSketches

Posted by GitBox <gi...@apache.org>.
cloventt commented on code in PR #13088:
URL: https://github.com/apache/druid/pull/13088#discussion_r973784759


##########
docs/development/extensions-core/datasketches-hll.md:
##########
@@ -59,6 +59,11 @@ druid.extensions.loadList=["druid-datasketches"]
  }
 ```
 
+The `HLLSketchBuild` aggregator builds a datasketch from the input column specified. If used during ingestion, this
+will result in Druid storing pre-generated HLL Sketch objects in the datasource, rather than the original value itself.
+If used at query time on an existing dimension, the resulting column can be used as an intermediate dimension by the
+post-aggregators below.
+

Review Comment:
   Style changes commited in df7d508, thanks for the suggestion!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] FrankChen021 merged pull request #13088: Add a note to the documentation about pre-built HLLSketches

Posted by GitBox <gi...@apache.org>.
FrankChen021 merged PR #13088:
URL: https://github.com/apache/druid/pull/13088


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] techdocsmith commented on a diff in pull request #13088: Add a note to the documentation about pre-built HLLSketches

Posted by GitBox <gi...@apache.org>.
techdocsmith commented on code in PR #13088:
URL: https://github.com/apache/druid/pull/13088#discussion_r972530970


##########
docs/development/extensions-core/datasketches-hll.md:
##########
@@ -89,6 +94,11 @@ druid.extensions.loadList=["druid-datasketches"]
  }
 ```
 
+The `HLLSketchMerge` aggregator can be used to ingest pre-generated sketches from an input dataset. For example, an
+earlier batch processing job can be used to generate the sketches before the data is sent to Druid. To support this
+behaviour, the sketches in the input dataset must be serialised to base64-encoded bytes. Then, in the native ingestion
+`MetricsSpec` the `HLLSketchMerge` must be specified for the input column as shown above.
+

Review Comment:
   ```suggestion
   You can use the `HLLSketchMerge` aggregator to ingest pre-generated sketches from an input dataset. For example, you can set up a batch processing job to generate the sketches before sending the data to Druid. You must serialize the sketches in the input dataset to Base64-encoded bytes. Then, specify `HLLSketchMerge` for the input column in the native ingestion `metricsSpec`.
   ```
   Stylistic suggestions. Also wonder if it might be helpful to have an example of the `MetricsSpec`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] cloventt commented on a diff in pull request #13088: Add a note to the documentation about pre-built HLLSketches

Posted by GitBox <gi...@apache.org>.
cloventt commented on code in PR #13088:
URL: https://github.com/apache/druid/pull/13088#discussion_r973784680


##########
docs/development/extensions-core/datasketches-hll.md:
##########
@@ -89,6 +94,11 @@ druid.extensions.loadList=["druid-datasketches"]
  }
 ```
 
+The `HLLSketchMerge` aggregator can be used to ingest pre-generated sketches from an input dataset. For example, an
+earlier batch processing job can be used to generate the sketches before the data is sent to Druid. To support this
+behaviour, the sketches in the input dataset must be serialised to base64-encoded bytes. Then, in the native ingestion
+`MetricsSpec` the `HLLSketchMerge` must be specified for the input column as shown above.
+

Review Comment:
   I'm not sure on the usefulness of an example - the code blob to be added to the `MetricsSpec` is already included directly above this paragraph.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] techdocsmith commented on a diff in pull request #13088: Add a note to the documentation about pre-built HLLSketches

Posted by GitBox <gi...@apache.org>.
techdocsmith commented on code in PR #13088:
URL: https://github.com/apache/druid/pull/13088#discussion_r972529744


##########
docs/development/extensions-core/datasketches-hll.md:
##########
@@ -59,6 +59,11 @@ druid.extensions.loadList=["druid-datasketches"]
  }
 ```
 
+The `HLLSketchBuild` aggregator builds a datasketch from the input column specified. If used during ingestion, this
+will result in Druid storing pre-generated HLL Sketch objects in the datasource, rather than the original value itself.
+If used at query time on an existing dimension, the resulting column can be used as an intermediate dimension by the
+post-aggregators below.
+

Review Comment:
   ```suggestion
   The `HLLSketchBuild` aggregator builds an HLL sketch object from the specified input column. When used during ingestion, Druid stores pre-generated HLL sketch objects in the datasource instead of the original values.
   When applied at query time on an existing dimension, you can use the resulting column as an intermediate dimension by the [post-aggregators](#post-aggregators).
   
   ```
   Thanks for the clarification/ contribution @cloventt ! I've suggest some stylistic changes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] techdocsmith commented on a diff in pull request #13088: Add a note to the documentation about pre-built HLLSketches

Posted by GitBox <gi...@apache.org>.
techdocsmith commented on code in PR #13088:
URL: https://github.com/apache/druid/pull/13088#discussion_r973370116


##########
docs/development/extensions-core/datasketches-hll.md:
##########
@@ -59,6 +59,11 @@ druid.extensions.loadList=["druid-datasketches"]
  }
 ```
 
+The `HLLSketchBuild` aggregator builds a datasketch from the input column specified. If used during ingestion, this
+will result in Druid storing pre-generated HLL Sketch objects in the datasource, rather than the original value itself.
+If used at query time on an existing dimension, the resulting column can be used as an intermediate dimension by the
+post-aggregators below.
+

Review Comment:
   That was the way I understood it. It stores the sketch/aggregation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] vtlim commented on a diff in pull request #13088: Add a note to the documentation about pre-built HLLSketches

Posted by GitBox <gi...@apache.org>.
vtlim commented on code in PR #13088:
URL: https://github.com/apache/druid/pull/13088#discussion_r973372379


##########
docs/development/extensions-core/datasketches-hll.md:
##########
@@ -89,6 +94,11 @@ druid.extensions.loadList=["druid-datasketches"]
  }
 ```
 
+The `HLLSketchMerge` aggregator can be used to ingest pre-generated sketches from an input dataset. For example, an
+earlier batch processing job can be used to generate the sketches before the data is sent to Druid. To support this
+behaviour, the sketches in the input dataset must be serialised to base64-encoded bytes. Then, in the native ingestion
+`MetricsSpec` the `HLLSketchMerge` must be specified for the input column as shown above.
+

Review Comment:
   Made a tiny bit of edits. I agree that an example could be helpful here!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org