You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by "clintropolis (via GitHub)" <gi...@apache.org> on 2023/05/30 03:33:31 UTC

[GitHub] [druid] clintropolis opened a new pull request, #14351: switch to front-coded v1 bucket size 4 by default

clintropolis opened a new pull request, #14351:
URL: https://github.com/apache/druid/pull/14351

   ### Description
   I think we should consider switching the `IndexSpec` default value of `stringDictionaryEncoding` to `{"type":"frontCoded", "bucketSize":4, "formatVersion":1}`.
   
   Based on measurements #13854 things look pretty good and we have been running version 0 of the format for some time on a number of datasources without any notable performance loss, and version 1 for a smaller amount of time. I think by the time 27 is released it should be sufficiently baked in to feel confident about it being the default.
   
   However, this means that upgrading from versions older than 26 will need special consideration, so it is important to call out in the release notes if we go forward with this.
   
   #### Release note
   Front coding was originally introduced in Druid 25.0, and an improved 'version 1' was introduced in Druid 26.0, with typically faster read speed and smaller storage size, has become the default in Druid 27.0. This means by default, segments created with Druid 27.0 are backwards compatible with Druid 26.0, but not compatible with Druid versions older than 26.0. If upgrading to Druid 27.0 from a version older than 26.0, the `stringDictionaryEncoding` should be set to `{"type": "utf8"}` to keep writing out the older format to enable seamless downgrades to Druid 25.0 and older, and then later is recommended to be changed to the new default once determined that rollback is not necessary.
   
   
   <hr>
   
   This PR has:
   
   - [ ] been self-reviewed.
      - [ ] using the [concurrency checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md) (Remove this item if the PR doesn't have any relation to concurrency.)
   - [ ] added documentation for new or modified features or behaviors.
   - [ ] a release note entry in the PR description.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in [licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
   - [ ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
   - [ ] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met.
   - [ ] added integration tests.
   - [ ] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


Re: [PR] switch to front-coded v1 bucket size 4 by default (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.
ektravel commented on code in PR #14351:
URL: https://github.com/apache/druid/pull/14351#discussion_r1261675259


##########
docs/ingestion/ingestion-spec.md:
##########
@@ -495,18 +495,18 @@ The `indexSpec` object can include the following properties:
 |-----|-----------|-------|
 |bitmap|Compression format for bitmap indexes. Should be a JSON object with `type` set to `roaring` or `concise`.|`{"type": "roaring"}`|
 |dimensionCompression|Compression format for dimension columns. Options are `lz4`, `lzf`, `zstd`, or `uncompressed`.|`lz4`|
-|stringDictionaryEncoding|Encoding format for STRING value dictionaries used by STRING and COMPLEX&lt;json&gt; columns. <br /><br />Example to enable front coding: `{"type":"frontCoded", "bucketSize": 4}`<br />`bucketSize` is the number of values to place in a bucket to perform delta encoding. Must be a power of 2, maximum is 128. Defaults to 4.<br /> `formatVersion` can specify older versions for backwards compatibility during rolling upgrades, valid options are `0` and `1`. Defaults to `0` for backwards compatibility.<br /><br />See [Front coding](#front-coding) for more information.|`{"type":"utf8"}`|
+|stringDictionaryEncoding|Encoding format for STRING value dictionaries used by STRING and COMPLEX&lt;json&gt; columns. <br /><br />Example to enable front coding: `{"type":"frontCoded", "bucketSize": 4}`<br />`bucketSize` is the number of values to place in a bucket to perform delta encoding. Must be a power of 2, maximum is 128. Defaults to 4.<br /> `formatVersion` can specify older versions for backwards compatibility during rolling upgrades, valid options are `0` and `1`, defaults to `1`.<br /><br />See [Front coding](#front-coding) for more information.|`{"type":"frontCoded", "bucketSize": 4, "formatVersion": 1}`|

Review Comment:
   ```suggestion
   |stringDictionaryEncoding|Encoding format for STRING value dictionaries used by STRING and COMPLEX&lt;json&gt; columns. <br /><br />Example to enable front coding: `{"type":"frontCoded", "bucketSize": 4}`<br />`bucketSize` is the number of values to place in a bucket to perform delta encoding. Must be a power of 2, maximum is 128. Defaults to 4.<br /> `formatVersion` can specify older versions for backwards compatibility during rolling upgrades. Valid options are 0 and 1. Defaults to 1.<br /><br />See [Front coding](#front-coding) for more information.|`{"type":"frontCoded", "bucketSize": 4, "formatVersion": 1}`|
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


Re: [PR] switch to front-coded v1 bucket size 4 by default (druid)

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #14351:
URL: https://github.com/apache/druid/pull/14351#issuecomment-1989680327

   This pull request/issue has been closed due to lack of activity. If you think that
   is incorrect, or the pull request requires review, you can revive the PR at any time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


Re: [PR] switch to front-coded v1 bucket size 4 by default (druid)

Posted by "ektravel (via GitHub)" <gi...@apache.org>.
ektravel commented on code in PR #14351:
URL: https://github.com/apache/druid/pull/14351#discussion_r1261680152


##########
docs/ingestion/ingestion-spec.md:
##########
@@ -495,18 +495,18 @@ The `indexSpec` object can include the following properties:
 |-----|-----------|-------|
 |bitmap|Compression format for bitmap indexes. Should be a JSON object with `type` set to `roaring` or `concise`.|`{"type": "roaring"}`|
 |dimensionCompression|Compression format for dimension columns. Options are `lz4`, `lzf`, `zstd`, or `uncompressed`.|`lz4`|
-|stringDictionaryEncoding|Encoding format for STRING value dictionaries used by STRING and COMPLEX&lt;json&gt; columns. <br /><br />Example to enable front coding: `{"type":"frontCoded", "bucketSize": 4}`<br />`bucketSize` is the number of values to place in a bucket to perform delta encoding. Must be a power of 2, maximum is 128. Defaults to 4.<br /> `formatVersion` can specify older versions for backwards compatibility during rolling upgrades, valid options are `0` and `1`. Defaults to `0` for backwards compatibility.<br /><br />See [Front coding](#front-coding) for more information.|`{"type":"utf8"}`|
+|stringDictionaryEncoding|Encoding format for STRING value dictionaries used by STRING and COMPLEX&lt;json&gt; columns. <br /><br />Example to enable front coding: `{"type":"frontCoded", "bucketSize": 4}`<br />`bucketSize` is the number of values to place in a bucket to perform delta encoding. Must be a power of 2, maximum is 128. Defaults to 4.<br /> `formatVersion` can specify older versions for backwards compatibility during rolling upgrades, valid options are `0` and `1`, defaults to `1`.<br /><br />See [Front coding](#front-coding) for more information.|`{"type":"frontCoded", "bucketSize": 4, "formatVersion": 1}`|
 |metricCompression|Compression format for primitive type metric columns. Options are `lz4`, `lzf`, `zstd`, `uncompressed`, or `none` (which is more efficient than `uncompressed`, but not supported by older versions of Druid).|`lz4`|
 |longEncoding|Encoding format for long-typed columns. Applies regardless of whether they are dimensions or metrics. Options are `auto` or `longs`. `auto` encodes the values using offset or lookup table depending on column cardinality, and store them with variable size. `longs` stores the value as-is with 8 bytes each.|`longs`|
 |jsonCompression|Compression format to use for nested column raw data. Options are `lz4`, `lzf`, `zstd`, or `uncompressed`.|`lz4`|
 
 ##### Front coding
 
-Front coding is an experimental feature starting in version 25.0. Front coding is an incremental encoding strategy that Druid can use to store STRING and [COMPLEX&lt;json&gt;](../querying/nested-columns.md) columns. It allows Druid to create smaller UTF-8 encoded segments with very little performance cost.
+Front coding is an incremental encoding strategy that Druid uses by default to store STRING and [COMPLEX&lt;json&gt;](../querying/nested-columns.md) columns. It allows Druid to create smaller UTF-8 encoded segments with very little performance cost.
 
-You can enable front coding with all types of ingestion. For information on defining an `indexSpec` in a query context, see [SQL-based ingestion reference](../multi-stage-query/reference.md#context-parameters).
+For information on defining an `indexSpec` in a query context, see [SQL-based ingestion reference](../multi-stage-query/reference.md#context-parameters).
 
-> Front coding was originally introduced in Druid 25.0, and an improved 'version 1' was introduced in Druid 26.0, with typically faster read speed and smaller storage size. The current recommendation is to enable it in a staging environment and fully test your use case before using in production. By default, segments created with front coding enabled in Druid 26.0 are backwards compatible with Druid 25.0, but those created with Druid 26.0 or 25.0 are not compatible with Druid versions older than 25.0. If using front coding in Druid 25.0 and upgrading to Druid 26.0, the `formatVersion` defaults to `0` to keep writing out the older format to enable seamless downgrades to Druid 25.0, and then later is recommended to be changed to `1` once determined that rollback is not necessary.
+> Front coding was originally introduced in Druid 25.0, and an improved 'version 1' was introduced in Druid 26.0, with typically faster read speed and smaller storage size, before finally becoming the default in Druid 27.0. By default, segments created with Druid 27.0 are backwards compatible with Druid 26.0, but not compatible with Druid versions older than 26.0. If upgrading to Druid 27.0 from a version older than 26.0, the `stringDictionaryEncoding` should be set to `{"type": "utf8"}` to keep writing out the older format to enable seamless downgrades to Druid 25.0 and older, and then later is recommended to be changed to the new default once determined that rollback is not necessary.

Review Comment:
   ```suggestion
   > Front coding was originally introduced in Druid 25.0. Then, an improved 'version 1' was introduced in Druid 26.0, with typically faster read speed and smaller storage size, before finally becoming the default in Druid 27.0. By default, segments created with Druid 27.0 are backwards compatible with Druid 26.0, but not compatible with Druid versions older than 26.0. If upgrading to Druid 27.0 from a version older than 26.0, set the `stringDictionaryEncoding` to `{"type": "utf8"}` to keep writing out the older format to enable seamless downgrades to Druid 25.0 and older, and then later is recommended to be changed to the new default once determined that rollback is not necessary.
   ```
   This part is a bit confusing:
   "...and then later is recommended to be changed to the new default once determined that rollback is not necessary."
   Are we recommending that users set `stringDictionaryEncoding` to front coding?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


Re: [PR] switch to front-coded v1 bucket size 4 by default (druid)

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #14351:
URL: https://github.com/apache/druid/pull/14351#issuecomment-1937927833

   This pull request has been marked as stale due to 60 days of inactivity.
   It will be closed in 4 weeks if no further activity occurs. If you think
   that's incorrect or this pull request should instead be reviewed, please simply
   write any comment. Even if closed, you can still revive the PR at any time or
   discuss it on the dev@druid.apache.org list.
   Thank you for your contributions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


Re: [PR] switch to front-coded v1 bucket size 4 by default (druid)

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] closed pull request #14351: switch to front-coded v1 bucket size 4 by default
URL: https://github.com/apache/druid/pull/14351


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


Re: [PR] switch to front-coded v1 bucket size 4 by default (druid)

Posted by "gianm (via GitHub)" <gi...@apache.org>.
gianm commented on PR #14351:
URL: https://github.com/apache/druid/pull/14351#issuecomment-1616842399

   People may find it difficult to update all their ingest specs, supervisors, etc, to ensure backwards compatibility. Is it possible to also add a runtime property that controls the default? That way, a cluster admin only has to set it in one place rather than track down all their users.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org