You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@druid.apache.org by "amaechler (via GitHub)" <gi...@apache.org> on 2023/05/30 23:54:10 UTC

[GitHub] [druid] amaechler opened a new pull request, #14356: Minor cleanups from working on sampling bugfix

amaechler opened a new pull request, #14356:
URL: https://github.com/apache/druid/pull/14356

   ### Description
   
   While working on #14355, I came across a bunch of typos, spacing inconsistencies, etc. I pulled those non-functional changes out into a separate PR.
   
   <hr>
   
   This PR has:
   
   - [ ] been self-reviewed.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] kfaraz commented on a diff in pull request #14356: Minor cleanups from working on sampling bugfix

Posted by "kfaraz (via GitHub)" <gi...@apache.org>.

kfaraz commented on code in PR #14356:
URL: https://github.com/apache/druid/pull/14356#discussion_r1210994575


##########
docs/development/extensions-core/kinesis-ingestion.md:
##########
@@ -284,7 +284,7 @@ The `tuningConfig` is optional. If no `tuningConfig` is specified, default param
 |`indexSpecForIntermediatePersists`|Object|Defines segment storage format options to be used at indexing time for intermediate persisted temporary segments. This can be used to disable dimension/metric compression on intermediate segments to reduce memory required for final merging. However, disabling compression on intermediate segments might increase page cache use while they are used before getting merged into final segment published, see [IndexSpec](#indexspec) for possible values.| no (default = same as `indexSpec`)|
 |`reportParseExceptions`|Boolean|If true, exceptions encountered during parsing will be thrown and will halt ingestion; if false, unparseable rows and fields will be skipped.|no (default == false)|
 |`handoffConditionTimeout`|Long| Milliseconds to wait for segment handoff. It must be >= 0, where 0 means to wait forever.| no (default == 0)|
-|`resetOffsetAutomatically`|Boolean|Controls behavior when Druid needs to read Kinesis messages that are no longer available.<br/><br/>If false, the exception will bubble up, which will cause your tasks to fail and ingestion to halt. If this occurs, manual intervention is required to correct the situation; potentially using the [Reset Supervisor API](../../api-reference/api-reference.md#supervisors). This mode is useful for production, since it will make you aware of issues with ingestion.<br/><br/>If true, Druid will automatically reset to the earlier or latest sequence number available in Kinesis, based on the value of the `useEarliestSequenceNumber` property (earliest if true, latest if false). Please note that this can lead to data being _DROPPED_ (if `useEarliestSequenceNumber` is false) or _DUPLICATED_ (if `useEarliestSequenceNumber` is true) without your knowledge. Messages will be logged indicating that a reset has occurred, but ingestion will continue. This mode is useful f
 or non-production situations, since it will make Druid attempt to recover from problems automatically, even if they lead to quiet dropping or duplicating of data.|no (default == false)|
+|`resetOffsetAutomatically`|Boolean|Controls behavior when Druid needs to read Kinesis messages that are no longer available.<br/><br/>If false, the exception will bubble up, which will cause your tasks to fail and ingestion to halt. If this occurs, manual intervention is required to correct the situation; potentially using the [Reset Supervisor API](../../api-reference/api-reference.md#supervisors). This mode is useful for production, since it will make you aware of issues with ingestion.<br/><br/>If true, Druid will automatically reset to the earlier or latest sequence number available in Kinesis, based on the value of the `useEarliestSequenceNumber` property (earliest if true, latest if false). Please note that this can lead to data being *DROPPED* (if `useEarliestSequenceNumber` is false) or *DUPLICATED* (if `useEarliestSequenceNumber` is true) without your knowledge. Messages will be logged indicating that a reset has occurred, but ingestion will continue. This mode is useful f
 or non-production situations, since it will make Druid attempt to recover from problems automatically, even if they lead to quiet dropping or duplicating of data.|no (default == false)|

Review Comment:
   Could you please indicate what has changed in this line? The diff doesn't show it clearly. The only change I could find is `_DROPPED_` to `*DROPPED*` which is equivalent markdown afaik. If that is the case, best revert this change as it affects neither readability nor the rendered HTML.



##########
docs/development/extensions-core/kinesis-ingestion.md:
##########
@@ -352,14 +353,14 @@ look for credentials set in environment variables, via [Web Identity Token](http
 profile provider (in this order).
 
 To ingest data from Kinesis, ensure that the policy attached to your IAM role contains the necessary permissions.
-The permissions needed depend on the value of `useListShards`.
+The required permissions depend on the value of `useListShards`.
 
 If the `useListShards` flag is set to `true`, you need following permissions:
 
-* `ListStreams`: required to list your data streams
-* `Get*`: required for `GetShardIterator`
-* `GetRecords`: required to get data records from a data stream's shard
-* `ListShards` : required to get the shards for a stream of interest
+- `ListStreams`: required to list your data streams

Review Comment:
   Both `*` and `-` can be used for lists. We can probably skip this change. Other places in the docs might be using `*` for lists too, so I don't think consistency is a real requirement here.



##########
docs/development/extensions-core/kinesis-ingestion.md:
##########
@@ -380,14 +381,14 @@ If the `useListShards` flag is set to `true`, you need following permissions:
 
 If the `useListShards` flag is set to `false`, you need following permissions:
 
-* `ListStreams`: required to list your data streams
-* `Get*`: required for `GetShardIterator`
-* `GetRecords`: required to get data records from a data stream's shard
-* `DescribeStream`: required to describe the specified data stream
+- `ListStreams`: required to list your data streams
+- `Get*`: required for `GetShardIterator`
+- `GetRecords`: required to get data records from a data stream's shard
+- `DescribeStream`: required to describe the specified data stream
 
 **Example policy**
 
-```    
+```

Review Comment:
   Use ` ```json ` maybe?



##########
docs/development/extensions-core/kinesis-ingestion.md:
##########
@@ -302,7 +302,7 @@ The `tuningConfig` is optional. If no `tuningConfig` is specified, default param
 |`maxParseExceptions`|Integer|The maximum number of parse exceptions that can occur before the task halts ingestion and fails. Overridden if `reportParseExceptions` is set.|no, unlimited default|
 |`maxSavedParseExceptions`|Integer|When a parse exception occurs, Druid can keep track of the most recent parse exceptions. "maxSavedParseExceptions" limits how many exception instances will be saved. These saved exceptions will be made available after the task finishes in the [task completion report](../../ingestion/tasks.md#task-reports). Overridden if `reportParseExceptions` is set.|no, default == 0|
 |`maxRecordsPerPoll`|Integer|The maximum number of records/events to be fetched from buffer per poll. The actual maximum will be `Max(maxRecordsPerPoll, Max(bufferSize, 1))`|no (see [Determining fetch settings](#determining-fetch-settings) for defaults)|
-|`repartitionTransitionDuration`|ISO8601 Period|When shards are split or merged, the supervisor will recompute shard -> task group mappings, and signal any running tasks created under the old mappings to stop early at (current time + `repartitionTransitionDuration`). Stopping the tasks early allows Druid to begin reading from the new shards more quickly. The repartition transition wait time controlled by this property gives the stream additional time to write records to the new shards after the split/merge, which helps avoid the issues with empty shard handling described at https://github.com/apache/druid/issues/7600.|no, (default == PT2M)|
+|`repartitionTransitionDuration`|ISO8601 Period|When shards are split or merged, the supervisor will recompute shard -> task group mappings, and signal any running tasks created under the old mappings to stop early at (current time + `repartitionTransitionDuration`). Stopping the tasks early allows Druid to begin reading from the new shards more quickly. The repartition transition wait time controlled by this property gives the stream additional time to write records to the new shards after the split/merge, which helps avoid the issues with empty shard handling described at <https://github.com/apache/druid/issues/7600>.|no, (default == PT2M)|

Review Comment:
   ```suggestion
   |`repartitionTransitionDuration`|ISO8601 Period|When shards are split or merged, the supervisor will recompute shard -> task group mappings, and signal any running tasks created under the old mappings to stop early at (current time + `repartitionTransitionDuration`). Stopping the tasks early allows Druid to begin reading from the new shards more quickly. The repartition transition wait time controlled by this property gives the stream additional time to write records to the new shards after the split/merge, which helps avoid the issues with empty shard handling described [in this issue](https://github.com/apache/druid/issues/7600).|no, (default == PT2M)|
   ```
   
   Maybe use some text which is hyperlinked to the url.



##########
docs/development/extensions-core/kinesis-ingestion.md:
##########
@@ -477,25 +479,25 @@ it will just ensure that no indexing tasks are running until the supervisor is r
 
 ### Resetting Supervisors
 
-The `POST /druid/indexer/v1/supervisor/<supervisorId>/reset` operation clears stored 
-sequence numbers, causing the supervisor to start reading from either the earliest or 
-latest sequence numbers in Kinesis (depending on the value of `useEarliestSequenceNumber`). 
-After clearing stored sequence numbers, the supervisor kills and recreates active tasks, 
-so that tasks begin reading from valid sequence numbers. 
+The `POST /druid/indexer/v1/supervisor/<supervisorId>/reset` operation clears stored

Review Comment:
   I would suggest leaving out this change as it is only removing trailing spaces. It would be better to do this whenever we update the content in this section.



##########
docs/development/extensions-core/kinesis-ingestion.md:
##########
@@ -635,10 +639,11 @@ To enable this feature, set `deaggregate` to true in your `ioConfig` when submit
 
 When changing the shard count for a Kinesis stream, there will be a window of time around the resharding operation with early shutdown of Kinesis ingestion tasks and possible task failures.
 
-The early shutdowns and task failures are expected, and they occur because the supervisor will update the shard -> task group mappings as shards are closed and fully read, to ensure that tasks are not running 
-with an assignment of closed shards that have been fully read and to ensure a balanced distribution of active shards across tasks. 
+The early shutdowns and task failures are expected, and they occur because the supervisor will update the shard -> task group mappings as shards are closed and fully read, to ensure that tasks are not running

Review Comment:
   Same comment about trailing spaces.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] kfaraz commented on a diff in pull request #14356: Minor cleanups from working on sampling bugfix

Posted by "kfaraz (via GitHub)" <gi...@apache.org>.

kfaraz commented on code in PR #14356:
URL: https://github.com/apache/druid/pull/14356#discussion_r1213441954


##########
docs/development/extensions-core/kinesis-ingestion.md:
##########
@@ -572,22 +574,23 @@ In this way, configuration changes can be applied without requiring any pause in
 #### On the Subject of Segments
 
 Each Kinesis Indexing Task puts events consumed from Kinesis Shards assigned to it in a single segment for each segment
-granular interval until maxRowsPerSegment, maxTotalRows or intermediateHandoffPeriod limit is reached, at this point a new shard
+granular interval until maxRowsPerSegment, maxTotalRows or intermediateHandoffPeriod limit is reached. At this point, a new shard
 for this segment granularity is created for further events. Kinesis Indexing Task also does incremental hand-offs which
 means that all the segments created by a task will not be held up till the task duration is over. As soon as maxRowsPerSegment,
 maxTotalRows or intermediateHandoffPeriod limit is hit, all the segments held by the task at that point in time will be handed-off
 and new set of segments will be created for further events. This means that the task can run for longer durations of time
-without accumulating old segments locally on Middle Manager processes and it is encouraged to do so.
+without accumulating old segments locally on Middle Manager processes, and it is encouraged to do so.
 
-Kinesis Indexing Service may still produce some small segments. Lets say the task duration is 4 hours, segment granularity
-is set to an HOUR and Supervisor was started at 9:10 then after 4 hours at 13:10, new set of tasks will be started and
-events for the interval 13:00 - 14:00 may be split across previous and new set of tasks. If you see it becoming a problem then
+Kinesis Indexing Service may still produce some small segments. Let's say the task duration is 4 hours, segment granularity
+is set to an HOUR and Supervisor was started at 9:10 then after 4 hours at 13:10. The new set of tasks will be started and
+events for the interval 13:00 - 14:00 may be split across a previous and new set of tasks. If you see it becoming a problem then

Review Comment:
   ```suggestion
   events for the interval 13:00 - 14:00 may be split across the previous and the new set of tasks. If you see it becoming a problem then
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] amaechler commented on pull request #14356: Minor cleanups from working on sampling bugfix

Posted by "amaechler (via GitHub)" <gi...@apache.org>.

amaechler commented on PR #14356:
URL: https://github.com/apache/druid/pull/14356#issuecomment-1572150565

   @kfaraz Reverted the improvements in the non-markdown files.
   
   (/cc @ektravel Do you mind taking a quick peek as well?)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] vtlim commented on a diff in pull request #14356: Minor cleanups from working on sampling bugfix

Posted by "vtlim (via GitHub)" <gi...@apache.org>.

vtlim commented on code in PR #14356:
URL: https://github.com/apache/druid/pull/14356#discussion_r1213516645


##########
docs/development/extensions-core/kinesis-ingestion.md:
##########
@@ -256,6 +255,7 @@ Kinesis indexing service supports both [`inputFormat`](../../ingestion/data-form
 Use the `inputFormat` to specify the data format for Kinesis indexing service unless you need a format only supported by the legacy `parser`.
 
 Supported `inputFormat`s include:

Review Comment:
   ```suggestion
   Supported values for `inputFormat` include:
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] ektravel commented on a diff in pull request #14356: Minor cleanups from working on sampling bugfix

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14356:
URL: https://github.com/apache/druid/pull/14356#discussion_r1213354247


##########
docs/development/extensions-core/kinesis-ingestion.md:
##########
@@ -577,9 +579,9 @@ for this segment granularity is created for further events. Kinesis Indexing Tas
 means that all the segments created by a task will not be held up till the task duration is over. As soon as maxRowsPerSegment,
 maxTotalRows or intermediateHandoffPeriod limit is hit, all the segments held by the task at that point in time will be handed-off
 and new set of segments will be created for further events. This means that the task can run for longer durations of time
-without accumulating old segments locally on Middle Manager processes and it is encouraged to do so.
+without accumulating old segments locally on Middle Manager processes, and it is encouraged to do so.
 
-Kinesis Indexing Service may still produce some small segments. Lets say the task duration is 4 hours, segment granularity
+Kinesis Indexing Service may still produce some small segments. Let's say the task duration is 4 hours, segment granularity
 is set to an HOUR and Supervisor was started at 9:10 then after 4 hours at 13:10, new set of tasks will be started and

Review Comment:
   ```suggestion
   is set to an HOUR and Supervisor was started at 9:10 then after 4 hours at 13:10. The new set of tasks will be started and
   ```
   Consider breaking this long sentence into two to make it easier to read.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] kfaraz commented on pull request #14356: Minor cleanups from working on sampling bugfix

Posted by "kfaraz (via GitHub)" <gi...@apache.org>.

kfaraz commented on PR #14356:
URL: https://github.com/apache/druid/pull/14356#issuecomment-1572439519

   @amaechler , sorry,  I had missed a couple of minor suggestions. Just added these.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] amaechler commented on a diff in pull request #14356: Minor cleanups from working on sampling bugfix

Posted by "amaechler (via GitHub)" <gi...@apache.org>.

amaechler commented on code in PR #14356:
URL: https://github.com/apache/druid/pull/14356#discussion_r1211031544


##########
docs/development/extensions-core/kinesis-ingestion.md:
##########
@@ -477,25 +479,25 @@ it will just ensure that no indexing tasks are running until the supervisor is r
 
 ### Resetting Supervisors
 
-The `POST /druid/indexer/v1/supervisor/<supervisorId>/reset` operation clears stored 
-sequence numbers, causing the supervisor to start reading from either the earliest or 
-latest sequence numbers in Kinesis (depending on the value of `useEarliestSequenceNumber`). 
-After clearing stored sequence numbers, the supervisor kills and recreates active tasks, 
-so that tasks begin reading from valid sequence numbers. 
+The `POST /druid/indexer/v1/supervisor/<supervisorId>/reset` operation clears stored

Review Comment:
   The next person might forget though, so wouldn't this set a good precedent (for cleanup)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] ektravel commented on a diff in pull request #14356: Minor cleanups from working on sampling bugfix

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14356:
URL: https://github.com/apache/druid/pull/14356#discussion_r1213357065


##########
docs/development/extensions-core/kinesis-ingestion.md:
##########
@@ -635,10 +639,11 @@ To enable this feature, set `deaggregate` to true in your `ioConfig` when submit
 
 When changing the shard count for a Kinesis stream, there will be a window of time around the resharding operation with early shutdown of Kinesis ingestion tasks and possible task failures.
 
-The early shutdowns and task failures are expected, and they occur because the supervisor will update the shard -> task group mappings as shards are closed and fully read, to ensure that tasks are not running 
-with an assignment of closed shards that have been fully read and to ensure a balanced distribution of active shards across tasks. 
+The early shutdowns and task failures are expected, and they occur because the supervisor will update the shard -> task group mappings as shards are closed and fully read, to ensure that tasks are not running
+with an assignment of closed shards that have been fully read and to ensure a balanced distribution of active shards across tasks.

Review Comment:
   ```suggestion
   with an assignment of closed shards that have been fully read and to ensure a balanced distribution of active shards across tasks.
   ```
   ```suggestion
   with an assignment of closed shards that have been fully read and balances distribution of active shards across tasks.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] amaechler commented on pull request #14356: Minor cleanups from working on sampling bugfix

Posted by "amaechler (via GitHub)" <gi...@apache.org>.

amaechler commented on PR #14356:
URL: https://github.com/apache/druid/pull/14356#issuecomment-1572116624

   Heya @kfaraz, thank you for your detailed thoughts. Alright, let me focus this PR on updating the Markdown doc only, and include some of the cleanups with #14355 (in case I touch that code).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] amaechler commented on a diff in pull request #14356: Minor cleanups from working on sampling bugfix

Posted by "amaechler (via GitHub)" <gi...@apache.org>.

amaechler commented on code in PR #14356:
URL: https://github.com/apache/druid/pull/14356#discussion_r1211031976


##########
docs/development/extensions-core/kinesis-ingestion.md:
##########
@@ -380,14 +381,14 @@ If the `useListShards` flag is set to `true`, you need following permissions:
 
 If the `useListShards` flag is set to `false`, you need following permissions:
 
-* `ListStreams`: required to list your data streams
-* `Get*`: required for `GetShardIterator`
-* `GetRecords`: required to get data records from a data stream's shard
-* `DescribeStream`: required to describe the specified data stream
+- `ListStreams`: required to list your data streams
+- `Get*`: required for `GetShardIterator`
+- `GetRecords`: required to get data records from a data stream's shard
+- `DescribeStream`: required to describe the specified data stream
 
 **Example policy**
 
-```    
+```

Review Comment:
   Good idea, updated this and another instance.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] amaechler commented on a diff in pull request #14356: Minor cleanups from working on sampling bugfix

Posted by "amaechler (via GitHub)" <gi...@apache.org>.

amaechler commented on code in PR #14356:
URL: https://github.com/apache/druid/pull/14356#discussion_r1211030222


##########
docs/development/extensions-core/kinesis-ingestion.md:
##########
@@ -302,7 +302,7 @@ The `tuningConfig` is optional. If no `tuningConfig` is specified, default param
 |`maxParseExceptions`|Integer|The maximum number of parse exceptions that can occur before the task halts ingestion and fails. Overridden if `reportParseExceptions` is set.|no, unlimited default|
 |`maxSavedParseExceptions`|Integer|When a parse exception occurs, Druid can keep track of the most recent parse exceptions. "maxSavedParseExceptions" limits how many exception instances will be saved. These saved exceptions will be made available after the task finishes in the [task completion report](../../ingestion/tasks.md#task-reports). Overridden if `reportParseExceptions` is set.|no, default == 0|
 |`maxRecordsPerPoll`|Integer|The maximum number of records/events to be fetched from buffer per poll. The actual maximum will be `Max(maxRecordsPerPoll, Max(bufferSize, 1))`|no (see [Determining fetch settings](#determining-fetch-settings) for defaults)|
-|`repartitionTransitionDuration`|ISO8601 Period|When shards are split or merged, the supervisor will recompute shard -> task group mappings, and signal any running tasks created under the old mappings to stop early at (current time + `repartitionTransitionDuration`). Stopping the tasks early allows Druid to begin reading from the new shards more quickly. The repartition transition wait time controlled by this property gives the stream additional time to write records to the new shards after the split/merge, which helps avoid the issues with empty shard handling described at https://github.com/apache/druid/issues/7600.|no, (default == PT2M)|
+|`repartitionTransitionDuration`|ISO8601 Period|When shards are split or merged, the supervisor will recompute shard -> task group mappings, and signal any running tasks created under the old mappings to stop early at (current time + `repartitionTransitionDuration`). Stopping the tasks early allows Druid to begin reading from the new shards more quickly. The repartition transition wait time controlled by this property gives the stream additional time to write records to the new shards after the split/merge, which helps avoid the issues with empty shard handling described at <https://github.com/apache/druid/issues/7600>.|no, (default == PT2M)|

Review Comment:
   Good idea, updated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] kfaraz commented on pull request #14356: Minor cleanups from working on sampling bugfix

Posted by "kfaraz (via GitHub)" <gi...@apache.org>.

kfaraz commented on PR #14356:
URL: https://github.com/apache/druid/pull/14356#issuecomment-1569437726

   > Do you think the benefit of a slightly cleaner git history outweighs leaving non-functional cleanups like these up for another day? 
   
   Yeah, I am torn about it at times myself 🤷🏻 .
   
   > Personally, I feel any updates similar to this one benefit future developers more than a clean history,
   
   I would have to disagree on the benefit to developers aspect as anyone reading a typo can immediately make out the correct alternative. Had we been _rephrasing_ the javadocs to use better language or adding new info, it could have been beneficial.
   
   > we should encourage anyone to make similar updates. After all, what are the chances that the next person touching the codebase will catch the exact same thing?
   
   I agree that we should encourage these improvements, however small. I personally do tend to include these with my own changes. 
   
   But I prefer a slightly different approach of going about them. I would rather contributors (and even reviewers) read the comments, javadocs, logs, exceptions in and around the code which is being touched in the PR at hand. This practice ensures that any outdated info is updated (which is often the case with comments and javadocs) and is bound to catch any formatting/typo issues too as an added benefit.
   
   Let me know what you think.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] kfaraz commented on a diff in pull request #14356: Minor cleanups from working on sampling bugfix

Posted by "kfaraz (via GitHub)" <gi...@apache.org>.

kfaraz commented on code in PR #14356:
URL: https://github.com/apache/druid/pull/14356#discussion_r1213438038


##########
docs/development/extensions-core/kinesis-ingestion.md:
##########
@@ -624,7 +624,7 @@ If the above limits are exceeded, Kinesis throws ProvisionedThroughputExceededEx
 Kinesis tasks pause by `fetchDelayMillis` or 3 seconds, whichever is larger, and then attempt the call again.
 
 In most cases, the default settings for fetch parameters are sufficient to achieve good performance without excessive
-memory usage. However, in some cases, you may need to adjust these parameters in order to more finely control fetch rate
+memory usage. However, in some cases, you may need to adjust these parameters to more finely control fetch rate

Review Comment:
   ```suggestion
   memory usage. However, in some cases, you may need to adjust these parameters to control fetch rate
   and memory usage more finely.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] kfaraz commented on a diff in pull request #14356: Minor cleanups from working on sampling bugfix

Posted by "kfaraz (via GitHub)" <gi...@apache.org>.

kfaraz commented on code in PR #14356:
URL: https://github.com/apache/druid/pull/14356#discussion_r1213441716


##########
docs/development/extensions-core/kinesis-ingestion.md:
##########
@@ -572,22 +574,23 @@ In this way, configuration changes can be applied without requiring any pause in
 #### On the Subject of Segments
 
 Each Kinesis Indexing Task puts events consumed from Kinesis Shards assigned to it in a single segment for each segment
-granular interval until maxRowsPerSegment, maxTotalRows or intermediateHandoffPeriod limit is reached, at this point a new shard
+granular interval until maxRowsPerSegment, maxTotalRows or intermediateHandoffPeriod limit is reached. At this point, a new shard
 for this segment granularity is created for further events. Kinesis Indexing Task also does incremental hand-offs which
 means that all the segments created by a task will not be held up till the task duration is over. As soon as maxRowsPerSegment,
 maxTotalRows or intermediateHandoffPeriod limit is hit, all the segments held by the task at that point in time will be handed-off
 and new set of segments will be created for further events. This means that the task can run for longer durations of time
-without accumulating old segments locally on Middle Manager processes and it is encouraged to do so.
+without accumulating old segments locally on Middle Manager processes, and it is encouraged to do so.
 
-Kinesis Indexing Service may still produce some small segments. Lets say the task duration is 4 hours, segment granularity
-is set to an HOUR and Supervisor was started at 9:10 then after 4 hours at 13:10, new set of tasks will be started and
-events for the interval 13:00 - 14:00 may be split across previous and new set of tasks. If you see it becoming a problem then
+Kinesis Indexing Service may still produce some small segments. Let's say the task duration is 4 hours, segment granularity
+is set to an HOUR and Supervisor was started at 9:10 then after 4 hours at 13:10. The new set of tasks will be started and

Review Comment:
   The period should be right before then.
   
   ```suggestion
   is set to an HOUR and Supervisor was started at 9:10. Then after 4 hours at 13:10, the new set of tasks will be started and
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] amaechler commented on a diff in pull request #14356: Minor cleanups from working on sampling bugfix

Posted by "amaechler (via GitHub)" <gi...@apache.org>.

amaechler commented on code in PR #14356:
URL: https://github.com/apache/druid/pull/14356#discussion_r1213500559


##########
docs/development/extensions-core/kinesis-ingestion.md:
##########
@@ -557,7 +557,7 @@ fail-overs.
 A supervisor is stopped via the `POST /druid/indexer/v1/supervisor/<supervisorId>/terminate` endpoint. This places a
 tombstone marker in the database (to prevent the supervisor from being reloaded on a restart) and then gracefully
 shuts down the currently running supervisor. When a supervisor is shut down in this way, it will instruct its
-managed tasks to stop reading and begin publishing their segments immediately. The call to the shutdown endpoint will
+managed tasks to stop reading and begin publishing its segments immediately. The call to the shutdown endpoint will

Review Comment:
   Thanks! I split it into to phrases to make that clear :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] kfaraz commented on pull request #14356: Minor cleanups from working on sampling bugfix

Posted by "kfaraz (via GitHub)" <gi...@apache.org>.

kfaraz commented on PR #14356:
URL: https://github.com/apache/druid/pull/14356#issuecomment-1573059484

   Reading through this PR, I realized that the whole of kinesis ingestion doc can do with a bit of language and structure revamp (e.g. adding a separate column for default values).
   
   But we can do that in a separate PR. @ektravel , would you be willing to take that up?
   
   Merging these changes for now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] ektravel commented on a diff in pull request #14356: Minor cleanups from working on sampling bugfix

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14356:
URL: https://github.com/apache/druid/pull/14356#discussion_r1213345582


##########
docs/development/extensions-core/kinesis-ingestion.md:
##########
@@ -284,25 +284,25 @@ The `tuningConfig` is optional. If no `tuningConfig` is specified, default param
 |`indexSpecForIntermediatePersists`|Object|Defines segment storage format options to be used at indexing time for intermediate persisted temporary segments. This can be used to disable dimension/metric compression on intermediate segments to reduce memory required for final merging. However, disabling compression on intermediate segments might increase page cache use while they are used before getting merged into final segment published, see [IndexSpec](#indexspec) for possible values.| no (default = same as `indexSpec`)|
 |`reportParseExceptions`|Boolean|If true, exceptions encountered during parsing will be thrown and will halt ingestion; if false, unparseable rows and fields will be skipped.|no (default == false)|
 |`handoffConditionTimeout`|Long| Milliseconds to wait for segment handoff. It must be >= 0, where 0 means to wait forever.| no (default == 0)|
-|`resetOffsetAutomatically`|Boolean|Controls behavior when Druid needs to read Kinesis messages that are no longer available.<br/><br/>If false, the exception will bubble up, which will cause your tasks to fail and ingestion to halt. If this occurs, manual intervention is required to correct the situation; potentially using the [Reset Supervisor API](../../api-reference/api-reference.md#supervisors). This mode is useful for production, since it will make you aware of issues with ingestion.<br/><br/>If true, Druid will automatically reset to the earlier or latest sequence number available in Kinesis, based on the value of the `useEarliestSequenceNumber` property (earliest if true, latest if false). Please note that this can lead to data being _DROPPED_ (if `useEarliestSequenceNumber` is false) or _DUPLICATED_ (if `useEarliestSequenceNumber` is true) without your knowledge. Messages will be logged indicating that a reset has occurred, but ingestion will continue. This mode is useful f
 or non-production situations, since it will make Druid attempt to recover from problems automatically, even if they lead to quiet dropping or duplicating of data.|no (default == false)|
+|`resetOffsetAutomatically`|Boolean|Controls behavior when Druid needs to read Kinesis messages that are no longer available.<br/><br/>If false, the exception will bubble up, which will cause your tasks to fail and ingestion to halt. If this occurs, manual intervention is required to correct the situation, potentially using the [Reset Supervisor API](../../api-reference/api-reference.md#supervisors). This mode is useful for production, since it will make you aware of issues with ingestion.<br/><br/>If true, Druid will automatically reset to the earliest or latest sequence number available in Kinesis, based on the value of the `useEarliestSequenceNumber` property (earliest if true, latest if false). Please note that this can lead to data being *DROPPED* (if `useEarliestSequenceNumber` is false) or *DUPLICATED* (if `useEarliestSequenceNumber` is true) without your knowledge. Messages will be logged indicating that a reset has occurred, but ingestion will continue. This mode is useful 
 for non-production situations since it will make Druid attempt to recover from problems automatically, even if they lead to quiet dropping or duplicating of data.|no (default == false)|
 |`skipSequenceNumberAvailabilityCheck`|Boolean|Whether to enable checking if the current sequence number is still available in a particular Kinesis shard. If set to false, the indexing task will attempt to reset the current sequence number (or not), depending on the value of `resetOffsetAutomatically`.|no (default == false)|
 |`workerThreads`|Integer|The number of threads that the supervisor uses to handle requests/responses for worker tasks, along with any other internal asynchronous operation.|no (default == min(10, taskCount))|
-|`chatAsync`|Boolean| If true, use asynchronous communication with indexing tasks, and ignore the `chatThreads` parameter. If false, use synchronous communication in a thread pool of size `chatThreads`.                                                                                                                                                                                                                                                                                                                                                                                                                                                                | no (default == true)                                                                |
+|`chatAsync`|Boolean| If true, use asynchronous communication with indexing tasks, and ignore the `chatThreads` parameter. If false, use synchronous communication in a thread pool of size `chatThreads`.| no (default == true)|
 |`chatThreads`|Integer| The number of threads that will be used for communicating with indexing tasks. Ignored if `chatAsync` is `true` (the default).| no (default == min(10, taskCount * replicas))|
 |`chatRetries`|Integer|The number of times HTTP requests to indexing tasks will be retried before considering tasks unresponsive.| no (default == 8)|
 |`httpTimeout`|ISO8601 Period|How long to wait for a HTTP response from an indexing task.|no (default == PT10S)|
 |`shutdownTimeout`|ISO8601 Period|How long to wait for the supervisor to attempt a graceful shutdown of tasks before exiting.|no (default == PT80S)|
 |`recordBufferSize`|Integer|Size of the buffer (number of events) used between the Kinesis fetch threads and the main ingestion thread.|no (see [Determining fetch settings](#determining-fetch-settings) for defaults)|
 |`recordBufferOfferTimeout`|Integer|Length of time in milliseconds to wait for space to become available in the buffer before timing out.| no (default == 5000)|
 |`recordBufferFullWait`|Integer|Length of time in milliseconds to wait for the buffer to drain before attempting to fetch records from Kinesis again.|no (default == 5000)|
-|`fetchThreads`|Integer|Size of the pool of threads fetching data from Kinesis. There is no benefit in having more threads than Kinesis shards.|no (default == procs * 2, where "procs" is the number of processors available to the task)                                                                  |
+|`fetchThreads`|Integer|Size of the pool of threads fetching data from Kinesis. There is no benefit in having more threads than Kinesis shards.|no (default == procs * 2, where "procs" is the number of processors available to the task)|
 |`segmentWriteOutMediumFactory`|Object|Segment write-out medium to use when creating segments. See below for more information.|no (not specified by default, the value from `druid.peon.defaultSegmentWriteOutMediumFactory.type` is used)|
 |`intermediateHandoffPeriod`|ISO8601 Period|How often the tasks should hand off segments. Handoff will happen either if `maxRowsPerSegment` or `maxTotalRows` is hit or every `intermediateHandoffPeriod`, whichever happens earlier.| no (default == P2147483647D)|
 |`logParseExceptions`|Boolean|If true, log an error message when a parsing exception occurs, containing information about the row where the error occurred.|no, default == false|
 |`maxParseExceptions`|Integer|The maximum number of parse exceptions that can occur before the task halts ingestion and fails. Overridden if `reportParseExceptions` is set.|no, unlimited default|
 |`maxSavedParseExceptions`|Integer|When a parse exception occurs, Druid can keep track of the most recent parse exceptions. "maxSavedParseExceptions" limits how many exception instances will be saved. These saved exceptions will be made available after the task finishes in the [task completion report](../../ingestion/tasks.md#task-reports). Overridden if `reportParseExceptions` is set.|no, default == 0|
 |`maxRecordsPerPoll`|Integer|The maximum number of records/events to be fetched from buffer per poll. The actual maximum will be `Max(maxRecordsPerPoll, Max(bufferSize, 1))`|no (see [Determining fetch settings](#determining-fetch-settings) for defaults)|
-|`repartitionTransitionDuration`|ISO8601 Period|When shards are split or merged, the supervisor will recompute shard -> task group mappings, and signal any running tasks created under the old mappings to stop early at (current time + `repartitionTransitionDuration`). Stopping the tasks early allows Druid to begin reading from the new shards more quickly. The repartition transition wait time controlled by this property gives the stream additional time to write records to the new shards after the split/merge, which helps avoid the issues with empty shard handling described at https://github.com/apache/druid/issues/7600.|no, (default == PT2M)|
+|`repartitionTransitionDuration`|ISO8601 Period|When shards are split or merged, the supervisor will recompute shard -> task group mappings, and signal any running tasks created under the old mappings to stop early at (current time + `repartitionTransitionDuration`). Stopping the tasks early allows Druid to begin reading from the new shards more quickly. The repartition transition wait time controlled by this property gives the stream additional time to write records to the new shards after the split/merge, which helps avoid issues with [empty shard handling](https://github.com/apache/druid/issues/7600).|no, (default == PT2M)|

Review Comment:
   ```suggestion
   |`repartitionTransitionDuration`|ISO8601 period|When shards are split or merged, the supervisor recomputes shard to task group mappings. The supervisor also signals any running tasks created under the old mappings to stop early at (current time + `repartitionTransitionDuration`). Stopping the tasks early allows Druid to begin reading from the new shards more quickly. The repartition transition wait time controlled by this property gives the stream additional time to write records to the new shards after the split or merge, which helps avoid issues with [empty shard handling](https://github.com/apache/druid/issues/7600).|no, (default == PT2M)|
   ```
   Removed future tense. 
   Consider removing parenthesis from "(current time + `repartitionTransitionDuration`)". Maybe rewrite that sentence so that there is no need to use parenthesis.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] ektravel commented on a diff in pull request #14356: Minor cleanups from working on sampling bugfix

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14356:
URL: https://github.com/apache/druid/pull/14356#discussion_r1213336850


##########
docs/development/extensions-core/kinesis-ingestion.md:
##########
@@ -284,25 +284,25 @@ The `tuningConfig` is optional. If no `tuningConfig` is specified, default param
 |`indexSpecForIntermediatePersists`|Object|Defines segment storage format options to be used at indexing time for intermediate persisted temporary segments. This can be used to disable dimension/metric compression on intermediate segments to reduce memory required for final merging. However, disabling compression on intermediate segments might increase page cache use while they are used before getting merged into final segment published, see [IndexSpec](#indexspec) for possible values.| no (default = same as `indexSpec`)|
 |`reportParseExceptions`|Boolean|If true, exceptions encountered during parsing will be thrown and will halt ingestion; if false, unparseable rows and fields will be skipped.|no (default == false)|
 |`handoffConditionTimeout`|Long| Milliseconds to wait for segment handoff. It must be >= 0, where 0 means to wait forever.| no (default == 0)|
-|`resetOffsetAutomatically`|Boolean|Controls behavior when Druid needs to read Kinesis messages that are no longer available.<br/><br/>If false, the exception will bubble up, which will cause your tasks to fail and ingestion to halt. If this occurs, manual intervention is required to correct the situation; potentially using the [Reset Supervisor API](../../api-reference/api-reference.md#supervisors). This mode is useful for production, since it will make you aware of issues with ingestion.<br/><br/>If true, Druid will automatically reset to the earlier or latest sequence number available in Kinesis, based on the value of the `useEarliestSequenceNumber` property (earliest if true, latest if false). Please note that this can lead to data being _DROPPED_ (if `useEarliestSequenceNumber` is false) or _DUPLICATED_ (if `useEarliestSequenceNumber` is true) without your knowledge. Messages will be logged indicating that a reset has occurred, but ingestion will continue. This mode is useful f
 or non-production situations, since it will make Druid attempt to recover from problems automatically, even if they lead to quiet dropping or duplicating of data.|no (default == false)|
+|`resetOffsetAutomatically`|Boolean|Controls behavior when Druid needs to read Kinesis messages that are no longer available.<br/><br/>If false, the exception will bubble up, which will cause your tasks to fail and ingestion to halt. If this occurs, manual intervention is required to correct the situation, potentially using the [Reset Supervisor API](../../api-reference/api-reference.md#supervisors). This mode is useful for production, since it will make you aware of issues with ingestion.<br/><br/>If true, Druid will automatically reset to the earliest or latest sequence number available in Kinesis, based on the value of the `useEarliestSequenceNumber` property (earliest if true, latest if false). Please note that this can lead to data being *DROPPED* (if `useEarliestSequenceNumber` is false) or *DUPLICATED* (if `useEarliestSequenceNumber` is true) without your knowledge. Messages will be logged indicating that a reset has occurred, but ingestion will continue. This mode is useful 
 for non-production situations since it will make Druid attempt to recover from problems automatically, even if they lead to quiet dropping or duplicating of data.|no (default == false)|
 |`skipSequenceNumberAvailabilityCheck`|Boolean|Whether to enable checking if the current sequence number is still available in a particular Kinesis shard. If set to false, the indexing task will attempt to reset the current sequence number (or not), depending on the value of `resetOffsetAutomatically`.|no (default == false)|
 |`workerThreads`|Integer|The number of threads that the supervisor uses to handle requests/responses for worker tasks, along with any other internal asynchronous operation.|no (default == min(10, taskCount))|
-|`chatAsync`|Boolean| If true, use asynchronous communication with indexing tasks, and ignore the `chatThreads` parameter. If false, use synchronous communication in a thread pool of size `chatThreads`.                                                                                                                                                                                                                                                                                                                                                                                                                                                                | no (default == true)                                                                |
+|`chatAsync`|Boolean| If true, use asynchronous communication with indexing tasks, and ignore the `chatThreads` parameter. If false, use synchronous communication in a thread pool of size `chatThreads`.| no (default == true)|

Review Comment:
   ```suggestion
   |`chatAsync`|Boolean| If true, the supervisor uses asynchronous communication with indexing tasks and ignores the `chatThreads` parameter. If false, the supervisor uses synchronous communication in a thread pool of size `chatThreads`.| no (default == true)|
   ```
   I am assuming this parameter controls the behavior of the supervisor.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] amaechler commented on a diff in pull request #14356: Minor cleanups from working on sampling bugfix

Posted by "amaechler (via GitHub)" <gi...@apache.org>.

amaechler commented on code in PR #14356:
URL: https://github.com/apache/druid/pull/14356#discussion_r1211028899


##########
docs/development/extensions-core/kinesis-ingestion.md:
##########
@@ -284,7 +284,7 @@ The `tuningConfig` is optional. If no `tuningConfig` is specified, default param
 |`indexSpecForIntermediatePersists`|Object|Defines segment storage format options to be used at indexing time for intermediate persisted temporary segments. This can be used to disable dimension/metric compression on intermediate segments to reduce memory required for final merging. However, disabling compression on intermediate segments might increase page cache use while they are used before getting merged into final segment published, see [IndexSpec](#indexspec) for possible values.| no (default = same as `indexSpec`)|
 |`reportParseExceptions`|Boolean|If true, exceptions encountered during parsing will be thrown and will halt ingestion; if false, unparseable rows and fields will be skipped.|no (default == false)|
 |`handoffConditionTimeout`|Long| Milliseconds to wait for segment handoff. It must be >= 0, where 0 means to wait forever.| no (default == 0)|
-|`resetOffsetAutomatically`|Boolean|Controls behavior when Druid needs to read Kinesis messages that are no longer available.<br/><br/>If false, the exception will bubble up, which will cause your tasks to fail and ingestion to halt. If this occurs, manual intervention is required to correct the situation; potentially using the [Reset Supervisor API](../../api-reference/api-reference.md#supervisors). This mode is useful for production, since it will make you aware of issues with ingestion.<br/><br/>If true, Druid will automatically reset to the earlier or latest sequence number available in Kinesis, based on the value of the `useEarliestSequenceNumber` property (earliest if true, latest if false). Please note that this can lead to data being _DROPPED_ (if `useEarliestSequenceNumber` is false) or _DUPLICATED_ (if `useEarliestSequenceNumber` is true) without your knowledge. Messages will be logged indicating that a reset has occurred, but ingestion will continue. This mode is useful f
 or non-production situations, since it will make Druid attempt to recover from problems automatically, even if they lead to quiet dropping or duplicating of data.|no (default == false)|
+|`resetOffsetAutomatically`|Boolean|Controls behavior when Druid needs to read Kinesis messages that are no longer available.<br/><br/>If false, the exception will bubble up, which will cause your tasks to fail and ingestion to halt. If this occurs, manual intervention is required to correct the situation; potentially using the [Reset Supervisor API](../../api-reference/api-reference.md#supervisors). This mode is useful for production, since it will make you aware of issues with ingestion.<br/><br/>If true, Druid will automatically reset to the earlier or latest sequence number available in Kinesis, based on the value of the `useEarliestSequenceNumber` property (earliest if true, latest if false). Please note that this can lead to data being *DROPPED* (if `useEarliestSequenceNumber` is false) or *DUPLICATED* (if `useEarliestSequenceNumber` is true) without your knowledge. Messages will be logged indicating that a reset has occurred, but ingestion will continue. This mode is useful f
 or non-production situations, since it will make Druid attempt to recover from problems automatically, even if they lead to quiet dropping or duplicating of data.|no (default == false)|

Review Comment:
   This change has been done automatically by a [markdownlint](https://marketplace.visualstudio.com/items?itemName=DavidAnson.vscode-markdownlint) extension I used, specifically [`MD049`](https://github.com/DavidAnson/markdownlint/blob/v0.28.2/doc/md049.md) for consistent emphasis style. You're correct that it doesn't render any different Markdown, but overall consistency is probably not a bad thing, wouldn't you agree?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] amaechler commented on pull request #14356: Minor cleanups from working on sampling bugfix

Posted by "amaechler (via GitHub)" <gi...@apache.org>.

amaechler commented on PR #14356:
URL: https://github.com/apache/druid/pull/14356#issuecomment-1569423897

   Heya @kfaraz, thanks for your review! 🙇
   
   > As for the java code, your changes do improve readability but it is still better to leave these changes to the next person touching these files for functional improvements. If you are updating the behaviour of these classes in any of your other PRs, you can include these there.
   
   Do you think the benefit of a slightly cleaner git history outweighs leaving non-functional cleanups like these up for another day? Personally, I feel any updates similar to this one benefit future developers more than a clean history, and we should encourage anyone to make similar updates. After all, what are the chances that the next person touching the codebase will catch the exact same thing?
   
   However, I of course can easily revert any of the Java comments if you feel strongly about this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] kfaraz merged pull request #14356: Docs: Minor cleanups from working on sampling bugfix

Posted by "kfaraz (via GitHub)" <gi...@apache.org>.

kfaraz merged PR #14356:
URL: https://github.com/apache/druid/pull/14356


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] ektravel commented on a diff in pull request #14356: Minor cleanups from working on sampling bugfix

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14356:
URL: https://github.com/apache/druid/pull/14356#discussion_r1213328784


##########
docs/development/extensions-core/kinesis-ingestion.md:
##########
@@ -284,25 +284,25 @@ The `tuningConfig` is optional. If no `tuningConfig` is specified, default param
 |`indexSpecForIntermediatePersists`|Object|Defines segment storage format options to be used at indexing time for intermediate persisted temporary segments. This can be used to disable dimension/metric compression on intermediate segments to reduce memory required for final merging. However, disabling compression on intermediate segments might increase page cache use while they are used before getting merged into final segment published, see [IndexSpec](#indexspec) for possible values.| no (default = same as `indexSpec`)|
 |`reportParseExceptions`|Boolean|If true, exceptions encountered during parsing will be thrown and will halt ingestion; if false, unparseable rows and fields will be skipped.|no (default == false)|
 |`handoffConditionTimeout`|Long| Milliseconds to wait for segment handoff. It must be >= 0, where 0 means to wait forever.| no (default == 0)|
-|`resetOffsetAutomatically`|Boolean|Controls behavior when Druid needs to read Kinesis messages that are no longer available.<br/><br/>If false, the exception will bubble up, which will cause your tasks to fail and ingestion to halt. If this occurs, manual intervention is required to correct the situation; potentially using the [Reset Supervisor API](../../api-reference/api-reference.md#supervisors). This mode is useful for production, since it will make you aware of issues with ingestion.<br/><br/>If true, Druid will automatically reset to the earlier or latest sequence number available in Kinesis, based on the value of the `useEarliestSequenceNumber` property (earliest if true, latest if false). Please note that this can lead to data being _DROPPED_ (if `useEarliestSequenceNumber` is false) or _DUPLICATED_ (if `useEarliestSequenceNumber` is true) without your knowledge. Messages will be logged indicating that a reset has occurred, but ingestion will continue. This mode is useful f
 or non-production situations, since it will make Druid attempt to recover from problems automatically, even if they lead to quiet dropping or duplicating of data.|no (default == false)|
+|`resetOffsetAutomatically`|Boolean|Controls behavior when Druid needs to read Kinesis messages that are no longer available.<br/><br/>If false, the exception will bubble up, which will cause your tasks to fail and ingestion to halt. If this occurs, manual intervention is required to correct the situation, potentially using the [Reset Supervisor API](../../api-reference/api-reference.md#supervisors). This mode is useful for production, since it will make you aware of issues with ingestion.<br/><br/>If true, Druid will automatically reset to the earliest or latest sequence number available in Kinesis, based on the value of the `useEarliestSequenceNumber` property (earliest if true, latest if false). Please note that this can lead to data being *DROPPED* (if `useEarliestSequenceNumber` is false) or *DUPLICATED* (if `useEarliestSequenceNumber` is true) without your knowledge. Messages will be logged indicating that a reset has occurred, but ingestion will continue. This mode is useful 
 for non-production situations since it will make Druid attempt to recover from problems automatically, even if they lead to quiet dropping or duplicating of data.|no (default == false)|

Review Comment:
   ```suggestion
   |`resetOffsetAutomatically`|Boolean|Controls behavior when Druid needs to read Kinesis messages that are no longer available.<br/><br/>If false, the exception bubbles up, causing tasks to fail and ingestion to halt. If this occurs, manual intervention is required to correct the situation, potentially using the [Reset Supervisor API](../../api-reference/api-reference.md#supervisors). This mode is useful for production, since it highlights issues with ingestion.<br/><br/>If true, Druid automatically resets to the earliest or latest sequence number available in Kinesis, based on the value of the `useEarliestSequenceNumber` property (earliest if true, latest if false). Note that this can lead to data being *DROPPED* (if `useEarliestSequenceNumber` is false) or *DUPLICATED* (if `useEarliestSequenceNumber` is true) without your knowledge. Druid will log messages indicating that a reset has occurred without interrupting ingestion. This mode is useful for non-production situations since i
 t enables Druid to recover from problems automatically, even if they lead to quiet dropping or duplicating of data.|no (default == false)|
   ```
   Removed most of the future tense and shortened some of the sentences to help with readability. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] ektravel commented on a diff in pull request #14356: Minor cleanups from working on sampling bugfix

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14356:
URL: https://github.com/apache/druid/pull/14356#discussion_r1213345897


##########
docs/development/extensions-core/kinesis-ingestion.md:
##########
@@ -284,25 +284,25 @@ The `tuningConfig` is optional. If no `tuningConfig` is specified, default param
 |`indexSpecForIntermediatePersists`|Object|Defines segment storage format options to be used at indexing time for intermediate persisted temporary segments. This can be used to disable dimension/metric compression on intermediate segments to reduce memory required for final merging. However, disabling compression on intermediate segments might increase page cache use while they are used before getting merged into final segment published, see [IndexSpec](#indexspec) for possible values.| no (default = same as `indexSpec`)|
 |`reportParseExceptions`|Boolean|If true, exceptions encountered during parsing will be thrown and will halt ingestion; if false, unparseable rows and fields will be skipped.|no (default == false)|
 |`handoffConditionTimeout`|Long| Milliseconds to wait for segment handoff. It must be >= 0, where 0 means to wait forever.| no (default == 0)|
-|`resetOffsetAutomatically`|Boolean|Controls behavior when Druid needs to read Kinesis messages that are no longer available.<br/><br/>If false, the exception will bubble up, which will cause your tasks to fail and ingestion to halt. If this occurs, manual intervention is required to correct the situation; potentially using the [Reset Supervisor API](../../api-reference/api-reference.md#supervisors). This mode is useful for production, since it will make you aware of issues with ingestion.<br/><br/>If true, Druid will automatically reset to the earlier or latest sequence number available in Kinesis, based on the value of the `useEarliestSequenceNumber` property (earliest if true, latest if false). Please note that this can lead to data being _DROPPED_ (if `useEarliestSequenceNumber` is false) or _DUPLICATED_ (if `useEarliestSequenceNumber` is true) without your knowledge. Messages will be logged indicating that a reset has occurred, but ingestion will continue. This mode is useful f
 or non-production situations, since it will make Druid attempt to recover from problems automatically, even if they lead to quiet dropping or duplicating of data.|no (default == false)|
+|`resetOffsetAutomatically`|Boolean|Controls behavior when Druid needs to read Kinesis messages that are no longer available.<br/><br/>If false, the exception will bubble up, which will cause your tasks to fail and ingestion to halt. If this occurs, manual intervention is required to correct the situation, potentially using the [Reset Supervisor API](../../api-reference/api-reference.md#supervisors). This mode is useful for production, since it will make you aware of issues with ingestion.<br/><br/>If true, Druid will automatically reset to the earliest or latest sequence number available in Kinesis, based on the value of the `useEarliestSequenceNumber` property (earliest if true, latest if false). Please note that this can lead to data being *DROPPED* (if `useEarliestSequenceNumber` is false) or *DUPLICATED* (if `useEarliestSequenceNumber` is true) without your knowledge. Messages will be logged indicating that a reset has occurred, but ingestion will continue. This mode is useful 
 for non-production situations since it will make Druid attempt to recover from problems automatically, even if they lead to quiet dropping or duplicating of data.|no (default == false)|
 |`skipSequenceNumberAvailabilityCheck`|Boolean|Whether to enable checking if the current sequence number is still available in a particular Kinesis shard. If set to false, the indexing task will attempt to reset the current sequence number (or not), depending on the value of `resetOffsetAutomatically`.|no (default == false)|
 |`workerThreads`|Integer|The number of threads that the supervisor uses to handle requests/responses for worker tasks, along with any other internal asynchronous operation.|no (default == min(10, taskCount))|
-|`chatAsync`|Boolean| If true, use asynchronous communication with indexing tasks, and ignore the `chatThreads` parameter. If false, use synchronous communication in a thread pool of size `chatThreads`.                                                                                                                                                                                                                                                                                                                                                                                                                                                                | no (default == true)                                                                |
+|`chatAsync`|Boolean| If true, use asynchronous communication with indexing tasks, and ignore the `chatThreads` parameter. If false, use synchronous communication in a thread pool of size `chatThreads`.| no (default == true)|
 |`chatThreads`|Integer| The number of threads that will be used for communicating with indexing tasks. Ignored if `chatAsync` is `true` (the default).| no (default == min(10, taskCount * replicas))|
 |`chatRetries`|Integer|The number of times HTTP requests to indexing tasks will be retried before considering tasks unresponsive.| no (default == 8)|
 |`httpTimeout`|ISO8601 Period|How long to wait for a HTTP response from an indexing task.|no (default == PT10S)|
 |`shutdownTimeout`|ISO8601 Period|How long to wait for the supervisor to attempt a graceful shutdown of tasks before exiting.|no (default == PT80S)|
 |`recordBufferSize`|Integer|Size of the buffer (number of events) used between the Kinesis fetch threads and the main ingestion thread.|no (see [Determining fetch settings](#determining-fetch-settings) for defaults)|
 |`recordBufferOfferTimeout`|Integer|Length of time in milliseconds to wait for space to become available in the buffer before timing out.| no (default == 5000)|
 |`recordBufferFullWait`|Integer|Length of time in milliseconds to wait for the buffer to drain before attempting to fetch records from Kinesis again.|no (default == 5000)|
-|`fetchThreads`|Integer|Size of the pool of threads fetching data from Kinesis. There is no benefit in having more threads than Kinesis shards.|no (default == procs * 2, where "procs" is the number of processors available to the task)                                                                  |
+|`fetchThreads`|Integer|Size of the pool of threads fetching data from Kinesis. There is no benefit in having more threads than Kinesis shards.|no (default == procs * 2, where "procs" is the number of processors available to the task)|
 |`segmentWriteOutMediumFactory`|Object|Segment write-out medium to use when creating segments. See below for more information.|no (not specified by default, the value from `druid.peon.defaultSegmentWriteOutMediumFactory.type` is used)|
 |`intermediateHandoffPeriod`|ISO8601 Period|How often the tasks should hand off segments. Handoff will happen either if `maxRowsPerSegment` or `maxTotalRows` is hit or every `intermediateHandoffPeriod`, whichever happens earlier.| no (default == P2147483647D)|
 |`logParseExceptions`|Boolean|If true, log an error message when a parsing exception occurs, containing information about the row where the error occurred.|no, default == false|
 |`maxParseExceptions`|Integer|The maximum number of parse exceptions that can occur before the task halts ingestion and fails. Overridden if `reportParseExceptions` is set.|no, unlimited default|
 |`maxSavedParseExceptions`|Integer|When a parse exception occurs, Druid can keep track of the most recent parse exceptions. "maxSavedParseExceptions" limits how many exception instances will be saved. These saved exceptions will be made available after the task finishes in the [task completion report](../../ingestion/tasks.md#task-reports). Overridden if `reportParseExceptions` is set.|no, default == 0|
 |`maxRecordsPerPoll`|Integer|The maximum number of records/events to be fetched from buffer per poll. The actual maximum will be `Max(maxRecordsPerPoll, Max(bufferSize, 1))`|no (see [Determining fetch settings](#determining-fetch-settings) for defaults)|
-|`repartitionTransitionDuration`|ISO8601 Period|When shards are split or merged, the supervisor will recompute shard -> task group mappings, and signal any running tasks created under the old mappings to stop early at (current time + `repartitionTransitionDuration`). Stopping the tasks early allows Druid to begin reading from the new shards more quickly. The repartition transition wait time controlled by this property gives the stream additional time to write records to the new shards after the split/merge, which helps avoid the issues with empty shard handling described at https://github.com/apache/druid/issues/7600.|no, (default == PT2M)|
+|`repartitionTransitionDuration`|ISO8601 Period|When shards are split or merged, the supervisor will recompute shard -> task group mappings, and signal any running tasks created under the old mappings to stop early at (current time + `repartitionTransitionDuration`). Stopping the tasks early allows Druid to begin reading from the new shards more quickly. The repartition transition wait time controlled by this property gives the stream additional time to write records to the new shards after the split/merge, which helps avoid issues with [empty shard handling](https://github.com/apache/druid/issues/7600).|no, (default == PT2M)|
 |`offsetFetchPeriod`|ISO8601 Period|How often the supervisor queries Kinesis and the indexing tasks to fetch current offsets and calculate lag. If the user-specified value is below the minimum value (`PT5S`), the supervisor ignores the value and uses the minimum value instead.|no (default == PT30S, min == PT5S)|

Review Comment:
   ```suggestion
   |`offsetFetchPeriod`|ISO8601 period|How often the supervisor queries Kinesis and the indexing tasks to fetch current offsets and calculate lag. If the user-specified value is below the minimum value (`PT5S`), the supervisor ignores the value and uses the minimum value instead.|no (default == PT30S, min == PT5S)|
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] ektravel commented on pull request #14356: Docs: Minor cleanups from working on sampling bugfix

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on PR #14356:
URL: https://github.com/apache/druid/pull/14356#issuecomment-1574038136

   @kfaraz I will create a separate PR to revamp the kinesis ingestion doc. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] ektravel commented on a diff in pull request #14356: Minor cleanups from working on sampling bugfix

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14356:
URL: https://github.com/apache/druid/pull/14356#discussion_r1213356358


##########
docs/development/extensions-core/kinesis-ingestion.md:
##########
@@ -635,10 +639,11 @@ To enable this feature, set `deaggregate` to true in your `ioConfig` when submit
 
 When changing the shard count for a Kinesis stream, there will be a window of time around the resharding operation with early shutdown of Kinesis ingestion tasks and possible task failures.
 
-The early shutdowns and task failures are expected, and they occur because the supervisor will update the shard -> task group mappings as shards are closed and fully read, to ensure that tasks are not running 
-with an assignment of closed shards that have been fully read and to ensure a balanced distribution of active shards across tasks. 
+The early shutdowns and task failures are expected, and they occur because the supervisor will update the shard -> task group mappings as shards are closed and fully read, to ensure that tasks are not running

Review Comment:
   ```suggestion
   The early shutdowns and task failures are expected. They occur because the supervisor updates the shard to task group mappings as shards are closed and fully read. This ensures that tasks are not running
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] kfaraz commented on a diff in pull request #14356: Minor cleanups from working on sampling bugfix

Posted by "kfaraz (via GitHub)" <gi...@apache.org>.

kfaraz commented on code in PR #14356:
URL: https://github.com/apache/druid/pull/14356#discussion_r1213436118


##########
docs/development/extensions-core/kinesis-ingestion.md:
##########
@@ -557,7 +557,7 @@ fail-overs.
 A supervisor is stopped via the `POST /druid/indexer/v1/supervisor/<supervisorId>/terminate` endpoint. This places a
 tombstone marker in the database (to prevent the supervisor from being reloaded on a restart) and then gracefully
 shuts down the currently running supervisor. When a supervisor is shut down in this way, it will instruct its
-managed tasks to stop reading and begin publishing their segments immediately. The call to the shutdown endpoint will
+managed tasks to stop reading and begin publishing its segments immediately. The call to the shutdown endpoint will

Review Comment:
   "publishing _their_ segments" was the correct version, I believe. The tasks are supposed to publish segments, not the supervisor.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] kfaraz commented on a diff in pull request #14356: Minor cleanups from working on sampling bugfix

Posted by "kfaraz (via GitHub)" <gi...@apache.org>.

kfaraz commented on code in PR #14356:
URL: https://github.com/apache/druid/pull/14356#discussion_r1213857782


##########
docs/development/extensions-core/kinesis-ingestion.md:
##########
@@ -136,7 +134,7 @@ Where the file `supervisor-spec.json` contains a Kinesis supervisor spec:
 |`period`|ISO8601 Period|How often the supervisor will execute its management logic. Note that the supervisor will also run in response to certain events (such as tasks succeeding, failing, and reaching their taskDuration) so this value specifies the maximum time between iterations.|no (default == PT30S)|
 |`useEarliestSequenceNumber`|Boolean|If a supervisor is managing a dataSource for the first time, it will obtain a set of starting sequence numbers from Kinesis. This flag determines whether it retrieves the earliest or latest sequence numbers in Kinesis. Under normal circumstances, subsequent tasks will start from where the previous segments ended so this flag will only be used on first run.|no (default == false)|
 |`completionTimeout`|ISO8601 Period|The length of time to wait before declaring a publishing task as failed and terminating it. If this is set too low, your tasks may never publish. The publishing clock for a task begins roughly after `taskDuration` elapses.|no (default == PT6H)|
-|`lateMessageRejectionPeriod`|ISO8601 Period|Configure tasks to reject messages with timestamps earlier than this period before the task was created; for example if this is set to `PT1H` and the supervisor creates a task at *2016-01-01T12:00Z*, messages with timestamps earlier than *2016-01-01T11:00Z* will be dropped. This may help prevent concurrency issues if your data stream has late messages and you have multiple pipelines that need to operate on the same segments (e.g. a realtime and a nightly batch ingestion pipeline).|no (default == none)|
+|`lateMessageRejectionPeriod`|ISO8601 Period|Configure tasks to reject messages with timestamps earlier than this period before the task was created; for example if this is set to `PT1H` and the supervisor creates a task at *2016-01-01T12:00Z*, messages with timestamps earlier than *2016-01-01T11:00Z* will be dropped. This may help prevent concurrency issues if your data stream has late messages and you have multiple pipelines that need to operate on the same segments (e.g. a streaming and a nightly batch ingestion pipeline).|no (default == none)|

Review Comment:
   👍🏻 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] ektravel commented on a diff in pull request #14356: Minor cleanups from working on sampling bugfix

Posted by "ektravel (via GitHub)" <gi...@apache.org>.

ektravel commented on code in PR #14356:
URL: https://github.com/apache/druid/pull/14356#discussion_r1213338418


##########
docs/development/extensions-core/kinesis-ingestion.md:
##########
@@ -284,25 +284,25 @@ The `tuningConfig` is optional. If no `tuningConfig` is specified, default param
 |`indexSpecForIntermediatePersists`|Object|Defines segment storage format options to be used at indexing time for intermediate persisted temporary segments. This can be used to disable dimension/metric compression on intermediate segments to reduce memory required for final merging. However, disabling compression on intermediate segments might increase page cache use while they are used before getting merged into final segment published, see [IndexSpec](#indexspec) for possible values.| no (default = same as `indexSpec`)|
 |`reportParseExceptions`|Boolean|If true, exceptions encountered during parsing will be thrown and will halt ingestion; if false, unparseable rows and fields will be skipped.|no (default == false)|
 |`handoffConditionTimeout`|Long| Milliseconds to wait for segment handoff. It must be >= 0, where 0 means to wait forever.| no (default == 0)|
-|`resetOffsetAutomatically`|Boolean|Controls behavior when Druid needs to read Kinesis messages that are no longer available.<br/><br/>If false, the exception will bubble up, which will cause your tasks to fail and ingestion to halt. If this occurs, manual intervention is required to correct the situation; potentially using the [Reset Supervisor API](../../api-reference/api-reference.md#supervisors). This mode is useful for production, since it will make you aware of issues with ingestion.<br/><br/>If true, Druid will automatically reset to the earlier or latest sequence number available in Kinesis, based on the value of the `useEarliestSequenceNumber` property (earliest if true, latest if false). Please note that this can lead to data being _DROPPED_ (if `useEarliestSequenceNumber` is false) or _DUPLICATED_ (if `useEarliestSequenceNumber` is true) without your knowledge. Messages will be logged indicating that a reset has occurred, but ingestion will continue. This mode is useful f
 or non-production situations, since it will make Druid attempt to recover from problems automatically, even if they lead to quiet dropping or duplicating of data.|no (default == false)|
+|`resetOffsetAutomatically`|Boolean|Controls behavior when Druid needs to read Kinesis messages that are no longer available.<br/><br/>If false, the exception will bubble up, which will cause your tasks to fail and ingestion to halt. If this occurs, manual intervention is required to correct the situation, potentially using the [Reset Supervisor API](../../api-reference/api-reference.md#supervisors). This mode is useful for production, since it will make you aware of issues with ingestion.<br/><br/>If true, Druid will automatically reset to the earliest or latest sequence number available in Kinesis, based on the value of the `useEarliestSequenceNumber` property (earliest if true, latest if false). Please note that this can lead to data being *DROPPED* (if `useEarliestSequenceNumber` is false) or *DUPLICATED* (if `useEarliestSequenceNumber` is true) without your knowledge. Messages will be logged indicating that a reset has occurred, but ingestion will continue. This mode is useful 
 for non-production situations since it will make Druid attempt to recover from problems automatically, even if they lead to quiet dropping or duplicating of data.|no (default == false)|
 |`skipSequenceNumberAvailabilityCheck`|Boolean|Whether to enable checking if the current sequence number is still available in a particular Kinesis shard. If set to false, the indexing task will attempt to reset the current sequence number (or not), depending on the value of `resetOffsetAutomatically`.|no (default == false)|
 |`workerThreads`|Integer|The number of threads that the supervisor uses to handle requests/responses for worker tasks, along with any other internal asynchronous operation.|no (default == min(10, taskCount))|
-|`chatAsync`|Boolean| If true, use asynchronous communication with indexing tasks, and ignore the `chatThreads` parameter. If false, use synchronous communication in a thread pool of size `chatThreads`.                                                                                                                                                                                                                                                                                                                                                                                                                                                                | no (default == true)                                                                |
+|`chatAsync`|Boolean| If true, use asynchronous communication with indexing tasks, and ignore the `chatThreads` parameter. If false, use synchronous communication in a thread pool of size `chatThreads`.| no (default == true)|
 |`chatThreads`|Integer| The number of threads that will be used for communicating with indexing tasks. Ignored if `chatAsync` is `true` (the default).| no (default == min(10, taskCount * replicas))|
 |`chatRetries`|Integer|The number of times HTTP requests to indexing tasks will be retried before considering tasks unresponsive.| no (default == 8)|
 |`httpTimeout`|ISO8601 Period|How long to wait for a HTTP response from an indexing task.|no (default == PT10S)|
 |`shutdownTimeout`|ISO8601 Period|How long to wait for the supervisor to attempt a graceful shutdown of tasks before exiting.|no (default == PT80S)|
 |`recordBufferSize`|Integer|Size of the buffer (number of events) used between the Kinesis fetch threads and the main ingestion thread.|no (see [Determining fetch settings](#determining-fetch-settings) for defaults)|
 |`recordBufferOfferTimeout`|Integer|Length of time in milliseconds to wait for space to become available in the buffer before timing out.| no (default == 5000)|
 |`recordBufferFullWait`|Integer|Length of time in milliseconds to wait for the buffer to drain before attempting to fetch records from Kinesis again.|no (default == 5000)|
-|`fetchThreads`|Integer|Size of the pool of threads fetching data from Kinesis. There is no benefit in having more threads than Kinesis shards.|no (default == procs * 2, where "procs" is the number of processors available to the task)                                                                  |
+|`fetchThreads`|Integer|Size of the pool of threads fetching data from Kinesis. There is no benefit in having more threads than Kinesis shards.|no (default == procs * 2, where "procs" is the number of processors available to the task)|

Review Comment:
   ```suggestion
   |`fetchThreads`|Integer|Size of the pool of threads fetching data from Kinesis. There is no benefit in having more threads than Kinesis shards.|no (default == procs * 2, where `procs` is the number of processors available to the task)|
   ```
   Should `procs` be in code font?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] amaechler commented on a diff in pull request #14356: Minor cleanups from working on sampling bugfix

Posted by "amaechler (via GitHub)" <gi...@apache.org>.

amaechler commented on code in PR #14356:
URL: https://github.com/apache/druid/pull/14356#discussion_r1211031042


##########
docs/development/extensions-core/kinesis-ingestion.md:
##########
@@ -352,14 +353,14 @@ look for credentials set in environment variables, via [Web Identity Token](http
 profile provider (in this order).
 
 To ingest data from Kinesis, ensure that the policy attached to your IAM role contains the necessary permissions.
-The permissions needed depend on the value of `useListShards`.
+The required permissions depend on the value of `useListShards`.
 
 If the `useListShards` flag is set to `true`, you need following permissions:
 
-* `ListStreams`: required to list your data streams
-* `Get*`: required for `GetShardIterator`
-* `GetRecords`: required to get data records from a data stream's shard
-* `ListShards` : required to get the shards for a stream of interest
+- `ListStreams`: required to list your data streams

Review Comment:
   Yeah, similarly to above tooling updated according to [MD004](https://github.com/DavidAnson/markdownlint/blob/v0.28.2/doc/md004.md).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] kfaraz commented on a diff in pull request #14356: Minor cleanups from working on sampling bugfix

Posted by "kfaraz (via GitHub)" <gi...@apache.org>.

kfaraz commented on code in PR #14356:
URL: https://github.com/apache/druid/pull/14356#discussion_r1211042955


##########
docs/development/extensions-core/kinesis-ingestion.md:
##########
@@ -284,7 +284,7 @@ The `tuningConfig` is optional. If no `tuningConfig` is specified, default param
 |`indexSpecForIntermediatePersists`|Object|Defines segment storage format options to be used at indexing time for intermediate persisted temporary segments. This can be used to disable dimension/metric compression on intermediate segments to reduce memory required for final merging. However, disabling compression on intermediate segments might increase page cache use while they are used before getting merged into final segment published, see [IndexSpec](#indexspec) for possible values.| no (default = same as `indexSpec`)|
 |`reportParseExceptions`|Boolean|If true, exceptions encountered during parsing will be thrown and will halt ingestion; if false, unparseable rows and fields will be skipped.|no (default == false)|
 |`handoffConditionTimeout`|Long| Milliseconds to wait for segment handoff. It must be >= 0, where 0 means to wait forever.| no (default == 0)|
-|`resetOffsetAutomatically`|Boolean|Controls behavior when Druid needs to read Kinesis messages that are no longer available.<br/><br/>If false, the exception will bubble up, which will cause your tasks to fail and ingestion to halt. If this occurs, manual intervention is required to correct the situation; potentially using the [Reset Supervisor API](../../api-reference/api-reference.md#supervisors). This mode is useful for production, since it will make you aware of issues with ingestion.<br/><br/>If true, Druid will automatically reset to the earlier or latest sequence number available in Kinesis, based on the value of the `useEarliestSequenceNumber` property (earliest if true, latest if false). Please note that this can lead to data being _DROPPED_ (if `useEarliestSequenceNumber` is false) or _DUPLICATED_ (if `useEarliestSequenceNumber` is true) without your knowledge. Messages will be logged indicating that a reset has occurred, but ingestion will continue. This mode is useful f
 or non-production situations, since it will make Druid attempt to recover from problems automatically, even if they lead to quiet dropping or duplicating of data.|no (default == false)|
+|`resetOffsetAutomatically`|Boolean|Controls behavior when Druid needs to read Kinesis messages that are no longer available.<br/><br/>If false, the exception will bubble up, which will cause your tasks to fail and ingestion to halt. If this occurs, manual intervention is required to correct the situation; potentially using the [Reset Supervisor API](../../api-reference/api-reference.md#supervisors). This mode is useful for production, since it will make you aware of issues with ingestion.<br/><br/>If true, Druid will automatically reset to the earlier or latest sequence number available in Kinesis, based on the value of the `useEarliestSequenceNumber` property (earliest if true, latest if false). Please note that this can lead to data being *DROPPED* (if `useEarliestSequenceNumber` is false) or *DUPLICATED* (if `useEarliestSequenceNumber` is true) without your knowledge. Messages will be logged indicating that a reset has occurred, but ingestion will continue. This mode is useful f
 or non-production situations, since it will make Druid attempt to recover from problems automatically, even if they lead to quiet dropping or duplicating of data.|no (default == false)|

Review Comment:
   Formatting tools can be tricky as every contributor might be using their own preferred one, unless there is one recommended as a Druid standard. There is only so much consistency we can (or would want to) enforce as after all, every developer has their own style. Think of it is as using a classic `for (i = 0; ++i)` loop vs a `for each` loop vs the `forEach` operator from the Streams API. In some cases, they might improve readability/object creation but in others, they really make no difference.
   
   I am okay with the change to the list item marker as it is fairly small. But this description of `resetOffsetAutomatically` is rather complicated and I would prefer we touch it only when needed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org