You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2021/08/04 21:40:06 UTC

[GitHub] [druid] gianm opened a new pull request #11550: SQL: Add is_active to sys.segments, update examples and docs.

gianm opened a new pull request #11550:
URL: https://github.com/apache/druid/pull/11550


   is_active is short for:
   
   ```
   (is_published = 1 AND is_overshadowed = 0) OR is_realtime = 1
   ```
   
   It's important because this represents "all the segments that should
   be queryable, whether or not they actually are right now". Most of the
   time, this is the set of segments that people will want to look at.
   
   The web console already adds this filter to a lot of its queries,
   proving its usefulness.
   
   This patch also reworks the caveat at the bottom of the sys.segments
   section, so its information is mixed into the description of each result
   field. This should make it more likely for people to see the information.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a change in pull request #11550: SQL: Add is_active to sys.segments, update examples and docs.

Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #11550:
URL: https://github.com/apache/druid/pull/11550#discussion_r735042259



##########
File path: docs/querying/sql.md
##########
@@ -1123,20 +1123,23 @@ Segments table provides details on all Druid segments, whether they are publishe
 |version|STRING|Version string (generally an ISO8601 timestamp corresponding to when the segment set was first started). Higher version means the more recently created segment. Version comparing is based on string comparison.|
 |partition_num|LONG|Partition number (an integer, unique within a datasource+interval+version; may not necessarily be contiguous)|
 |num_replicas|LONG|Number of replicas of this segment currently being served|
-|num_rows|LONG|Number of rows in current segment, this value could be null if unknown to Broker at query time|
-|is_published|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 represents this segment has been published to the metadata store with `used=1`. See the [Architecture page](../design/architecture.md#segment-lifecycle) for more details.|
-|is_available|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 if this segment is currently being served by any process(Historical or realtime). See the [Architecture page](../design/architecture.md#segment-lifecycle) for more details.|
-|is_realtime|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 if this segment is _only_ served by realtime tasks, and 0 if any historical process is serving this segment.|
-|is_overshadowed|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 if this segment is published and is _fully_ overshadowed by some other published segments. Currently, is_overshadowed is always false for unpublished segments, although this may change in the future. You can filter for segments that "should be published" by filtering for `is_published = 1 AND is_overshadowed = 0`. Segments can briefly be both published and overshadowed if they were recently replaced, but have not been unpublished yet. See the [Architecture page](../design/architecture.md#segment-lifecycle) for more details.|
+|num_rows|LONG|Number of rows in this segment. This field is updated in the background and cached on the Broker. It may be null if the Broker has not gathered a row count for this segment yet. It may not match the result of `count(*)` queries on realtime data, because the cached value on the Broker may be out of date, and because different replicas of realtime segments may not be in sync with each other. Once a segment is published, its row count will settle and stop changing.|

Review comment:
       There's a little bit of delay between when a segment is published and when num_rows becomes fully accurate, because it's fetched via doing a query to a data server, rather than appearing in the published segment descriptor. I updated the wording to the following, which is hopefully more clear:
   
   > Number of rows in this segment, or zero if the number of rows is not known.
   >
   > This row count is gathered by the Broker in the background. It will be zero if the Broker has not gathered a row count for this segment yet. For segments ingested from streams, the reported row count may lag behind the result of a `count(*)` query because the cached `num_rows` on the Broker may be out of date. This will settle shortly after new rows stop being written to that particular segment.
   
   (I also changed "null" to "zero" because that's what it actually is.)

##########
File path: docs/querying/sql.md
##########
@@ -1123,20 +1123,23 @@ Segments table provides details on all Druid segments, whether they are publishe
 |version|STRING|Version string (generally an ISO8601 timestamp corresponding to when the segment set was first started). Higher version means the more recently created segment. Version comparing is based on string comparison.|
 |partition_num|LONG|Partition number (an integer, unique within a datasource+interval+version; may not necessarily be contiguous)|
 |num_replicas|LONG|Number of replicas of this segment currently being served|
-|num_rows|LONG|Number of rows in current segment, this value could be null if unknown to Broker at query time|
-|is_published|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 represents this segment has been published to the metadata store with `used=1`. See the [Architecture page](../design/architecture.md#segment-lifecycle) for more details.|
-|is_available|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 if this segment is currently being served by any process(Historical or realtime). See the [Architecture page](../design/architecture.md#segment-lifecycle) for more details.|
-|is_realtime|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 if this segment is _only_ served by realtime tasks, and 0 if any historical process is serving this segment.|
-|is_overshadowed|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 if this segment is published and is _fully_ overshadowed by some other published segments. Currently, is_overshadowed is always false for unpublished segments, although this may change in the future. You can filter for segments that "should be published" by filtering for `is_published = 1 AND is_overshadowed = 0`. Segments can briefly be both published and overshadowed if they were recently replaced, but have not been unpublished yet. See the [Architecture page](../design/architecture.md#segment-lifecycle) for more details.|
+|num_rows|LONG|Number of rows in this segment. This field is updated in the background and cached on the Broker. It may be null if the Broker has not gathered a row count for this segment yet. It may not match the result of `count(*)` queries on realtime data, because the cached value on the Broker may be out of date, and because different replicas of realtime segments may not be in sync with each other. Once a segment is published, its row count will settle and stop changing.|
+|is_active|LONG|Boolean represented as long type where 1 = true, 0 = false. True for segments that are either available and queryable, or _should be_ available and querayble. Equivalent to `(is_published = 1 AND is_overshadowed = 0) OR is_realtime = 1`.|

Review comment:
       The context with the "should be" is that everything with regard to ingestion and segment availability happens in the background and is asynchronous. So some segments maybe should be available, but aren't right now, and the system will work to make them available. Some others maybe are available, but shouldn't be (because they were dropped or replaced), and the system will work to make them unavailable.
   
   I changed the wording to hopefully be more clear:
   
   > True for segments that represent the latest state of a datasource.
   >
   > Equivalent to `(is_published = 1 AND is_overshadowed = 0) OR is_realtime = 1`. In steady state, when no ingestions or data management operations are happening, `is_active` will be equivalent to `is_available`. However, they may differ from each other when ingestions or data management operations have executed recently. In these cases, Druid will load and unload segments appropriately to bring actual availability in line with the expected state given by `is_active`.

##########
File path: docs/querying/sql.md
##########
@@ -1123,20 +1123,23 @@ Segments table provides details on all Druid segments, whether they are publishe
 |version|STRING|Version string (generally an ISO8601 timestamp corresponding to when the segment set was first started). Higher version means the more recently created segment. Version comparing is based on string comparison.|
 |partition_num|LONG|Partition number (an integer, unique within a datasource+interval+version; may not necessarily be contiguous)|
 |num_replicas|LONG|Number of replicas of this segment currently being served|
-|num_rows|LONG|Number of rows in current segment, this value could be null if unknown to Broker at query time|
-|is_published|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 represents this segment has been published to the metadata store with `used=1`. See the [Architecture page](../design/architecture.md#segment-lifecycle) for more details.|
-|is_available|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 if this segment is currently being served by any process(Historical or realtime). See the [Architecture page](../design/architecture.md#segment-lifecycle) for more details.|
-|is_realtime|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 if this segment is _only_ served by realtime tasks, and 0 if any historical process is serving this segment.|
-|is_overshadowed|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 if this segment is published and is _fully_ overshadowed by some other published segments. Currently, is_overshadowed is always false for unpublished segments, although this may change in the future. You can filter for segments that "should be published" by filtering for `is_published = 1 AND is_overshadowed = 0`. Segments can briefly be both published and overshadowed if they were recently replaced, but have not been unpublished yet. See the [Architecture page](../design/architecture.md#segment-lifecycle) for more details.|
+|num_rows|LONG|Number of rows in this segment. This field is updated in the background and cached on the Broker. It may be null if the Broker has not gathered a row count for this segment yet. It may not match the result of `count(*)` queries on realtime data, because the cached value on the Broker may be out of date, and because different replicas of realtime segments may not be in sync with each other. Once a segment is published, its row count will settle and stop changing.|
+|is_active|LONG|Boolean represented as long type where 1 = true, 0 = false. True for segments that are either available and queryable, or _should be_ available and querayble. Equivalent to `(is_published = 1 AND is_overshadowed = 0) OR is_realtime = 1`.|
+|is_published|LONG|Boolean represented as long type where 1 = true, 0 = false. 1 represents this segment has been published to the metadata store with `used=1`. See the [segment lifecycle documentation](../design/architecture.md#segment-lifecycle) for more details.|

Review comment:
       Yes.

##########
File path: docs/querying/sql.md
##########
@@ -1123,20 +1123,23 @@ Segments table provides details on all Druid segments, whether they are publishe
 |version|STRING|Version string (generally an ISO8601 timestamp corresponding to when the segment set was first started). Higher version means the more recently created segment. Version comparing is based on string comparison.|
 |partition_num|LONG|Partition number (an integer, unique within a datasource+interval+version; may not necessarily be contiguous)|
 |num_replicas|LONG|Number of replicas of this segment currently being served|
-|num_rows|LONG|Number of rows in current segment, this value could be null if unknown to Broker at query time|
-|is_published|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 represents this segment has been published to the metadata store with `used=1`. See the [Architecture page](../design/architecture.md#segment-lifecycle) for more details.|
-|is_available|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 if this segment is currently being served by any process(Historical or realtime). See the [Architecture page](../design/architecture.md#segment-lifecycle) for more details.|
-|is_realtime|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 if this segment is _only_ served by realtime tasks, and 0 if any historical process is serving this segment.|
-|is_overshadowed|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 if this segment is published and is _fully_ overshadowed by some other published segments. Currently, is_overshadowed is always false for unpublished segments, although this may change in the future. You can filter for segments that "should be published" by filtering for `is_published = 1 AND is_overshadowed = 0`. Segments can briefly be both published and overshadowed if they were recently replaced, but have not been unpublished yet. See the [Architecture page](../design/architecture.md#segment-lifecycle) for more details.|
+|num_rows|LONG|Number of rows in this segment. This field is updated in the background and cached on the Broker. It may be null if the Broker has not gathered a row count for this segment yet. It may not match the result of `count(*)` queries on realtime data, because the cached value on the Broker may be out of date, and because different replicas of realtime segments may not be in sync with each other. Once a segment is published, its row count will settle and stop changing.|
+|is_active|LONG|Boolean represented as long type where 1 = true, 0 = false. True for segments that are either available and queryable, or _should be_ available and querayble. Equivalent to `(is_published = 1 AND is_overshadowed = 0) OR is_realtime = 1`.|
+|is_published|LONG|Boolean represented as long type where 1 = true, 0 = false. 1 represents this segment has been published to the metadata store with `used=1`. See the [segment lifecycle documentation](../design/architecture.md#segment-lifecycle) for more details.|
+|is_available|LONG|Boolean represented as long type where 1 = true, 0 = false. 1 if this segment is currently being served by any process(Historical or realtime). See the [segment lifecycle documentation](../design/architecture.md#segment-lifecycle) for more details.|
+|is_realtime|LONG|Boolean represented as long type where 1 = true, 0 = false. 1 if this segment is _only_ served by realtime tasks, and 0 if any historical process is serving this segment.|
+|is_overshadowed|LONG|Boolean represented as long type where 1 = true, 0 = false. 1 if this segment is published and is _fully_ overshadowed by some other published segments. Currently, is_overshadowed is always 0 for unpublished segments, although this may change in the future. You can filter for segments that "should be published" by filtering for `is_published = 1 AND is_overshadowed = 0`. Segments can briefly be both published and overshadowed if they were recently replaced, but have not been unpublished yet. See the [segment lifecycle documentation](../design/architecture.md#segment-lifecycle) for more details.|

Review comment:
       Thanks, fixed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] paul-rogers commented on a change in pull request #11550: SQL: Add is_active to sys.segments, update examples and docs.

Posted by GitBox <gi...@apache.org>.
paul-rogers commented on a change in pull request #11550:
URL: https://github.com/apache/druid/pull/11550#discussion_r682987219



##########
File path: docs/querying/sql.md
##########
@@ -1123,20 +1123,23 @@ Segments table provides details on all Druid segments, whether they are publishe
 |version|STRING|Version string (generally an ISO8601 timestamp corresponding to when the segment set was first started). Higher version means the more recently created segment. Version comparing is based on string comparison.|
 |partition_num|LONG|Partition number (an integer, unique within a datasource+interval+version; may not necessarily be contiguous)|
 |num_replicas|LONG|Number of replicas of this segment currently being served|
-|num_rows|LONG|Number of rows in current segment, this value could be null if unknown to Broker at query time|
-|is_published|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 represents this segment has been published to the metadata store with `used=1`. See the [Architecture page](../design/architecture.md#segment-lifecycle) for more details.|
-|is_available|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 if this segment is currently being served by any process(Historical or realtime). See the [Architecture page](../design/architecture.md#segment-lifecycle) for more details.|
-|is_realtime|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 if this segment is _only_ served by realtime tasks, and 0 if any historical process is serving this segment.|
-|is_overshadowed|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 if this segment is published and is _fully_ overshadowed by some other published segments. Currently, is_overshadowed is always false for unpublished segments, although this may change in the future. You can filter for segments that "should be published" by filtering for `is_published = 1 AND is_overshadowed = 0`. Segments can briefly be both published and overshadowed if they were recently replaced, but have not been unpublished yet. See the [Architecture page](../design/architecture.md#segment-lifecycle) for more details.|
+|num_rows|LONG|Number of rows in this segment. This field is updated in the background and cached on the Broker. It may be null if the Broker has not gathered a row count for this segment yet. It may not match the result of `count(*)` queries on realtime data, because the cached value on the Broker may be out of date, and because different replicas of realtime segments may not be in sync with each other. Once a segment is published, its row count will settle and stop changing.|
+|is_active|LONG|Boolean represented as long type where 1 = true, 0 = false. True for segments that are either available and queryable, or _should be_ available and querayble. Equivalent to `(is_published = 1 AND is_overshadowed = 0) OR is_realtime = 1`.|

Review comment:
       This is the second (third) place in the docs that emphasizes *should*. Is this notion explained anywhere? Does this mean that the segment is scheduled to load into a Historical, but has not yet done so? Or, does it mean there is some kind of problem that the user must resolve?

##########
File path: docs/querying/sql.md
##########
@@ -1123,20 +1123,23 @@ Segments table provides details on all Druid segments, whether they are publishe
 |version|STRING|Version string (generally an ISO8601 timestamp corresponding to when the segment set was first started). Higher version means the more recently created segment. Version comparing is based on string comparison.|
 |partition_num|LONG|Partition number (an integer, unique within a datasource+interval+version; may not necessarily be contiguous)|
 |num_replicas|LONG|Number of replicas of this segment currently being served|
-|num_rows|LONG|Number of rows in current segment, this value could be null if unknown to Broker at query time|
-|is_published|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 represents this segment has been published to the metadata store with `used=1`. See the [Architecture page](../design/architecture.md#segment-lifecycle) for more details.|
-|is_available|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 if this segment is currently being served by any process(Historical or realtime). See the [Architecture page](../design/architecture.md#segment-lifecycle) for more details.|
-|is_realtime|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 if this segment is _only_ served by realtime tasks, and 0 if any historical process is serving this segment.|
-|is_overshadowed|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 if this segment is published and is _fully_ overshadowed by some other published segments. Currently, is_overshadowed is always false for unpublished segments, although this may change in the future. You can filter for segments that "should be published" by filtering for `is_published = 1 AND is_overshadowed = 0`. Segments can briefly be both published and overshadowed if they were recently replaced, but have not been unpublished yet. See the [Architecture page](../design/architecture.md#segment-lifecycle) for more details.|
+|num_rows|LONG|Number of rows in this segment. This field is updated in the background and cached on the Broker. It may be null if the Broker has not gathered a row count for this segment yet. It may not match the result of `count(*)` queries on realtime data, because the cached value on the Broker may be out of date, and because different replicas of realtime segments may not be in sync with each other. Once a segment is published, its row count will settle and stop changing.|
+|is_active|LONG|Boolean represented as long type where 1 = true, 0 = false. True for segments that are either available and queryable, or _should be_ available and querayble. Equivalent to `(is_published = 1 AND is_overshadowed = 0) OR is_realtime = 1`.|
+|is_published|LONG|Boolean represented as long type where 1 = true, 0 = false. 1 represents this segment has been published to the metadata store with `used=1`. See the [segment lifecycle documentation](../design/architecture.md#segment-lifecycle) for more details.|

Review comment:
       Presumably "published to the metadata store" means "by the MiddleManager at the completion of ingestion"?

##########
File path: docs/querying/sql.md
##########
@@ -1123,20 +1123,23 @@ Segments table provides details on all Druid segments, whether they are publishe
 |version|STRING|Version string (generally an ISO8601 timestamp corresponding to when the segment set was first started). Higher version means the more recently created segment. Version comparing is based on string comparison.|
 |partition_num|LONG|Partition number (an integer, unique within a datasource+interval+version; may not necessarily be contiguous)|
 |num_replicas|LONG|Number of replicas of this segment currently being served|
-|num_rows|LONG|Number of rows in current segment, this value could be null if unknown to Broker at query time|
-|is_published|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 represents this segment has been published to the metadata store with `used=1`. See the [Architecture page](../design/architecture.md#segment-lifecycle) for more details.|
-|is_available|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 if this segment is currently being served by any process(Historical or realtime). See the [Architecture page](../design/architecture.md#segment-lifecycle) for more details.|
-|is_realtime|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 if this segment is _only_ served by realtime tasks, and 0 if any historical process is serving this segment.|
-|is_overshadowed|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 if this segment is published and is _fully_ overshadowed by some other published segments. Currently, is_overshadowed is always false for unpublished segments, although this may change in the future. You can filter for segments that "should be published" by filtering for `is_published = 1 AND is_overshadowed = 0`. Segments can briefly be both published and overshadowed if they were recently replaced, but have not been unpublished yet. See the [Architecture page](../design/architecture.md#segment-lifecycle) for more details.|
+|num_rows|LONG|Number of rows in this segment. This field is updated in the background and cached on the Broker. It may be null if the Broker has not gathered a row count for this segment yet. It may not match the result of `count(*)` queries on realtime data, because the cached value on the Broker may be out of date, and because different replicas of realtime segments may not be in sync with each other. Once a segment is published, its row count will settle and stop changing.|

Review comment:
       Change the wording a bit? Seems the key bit for a user to know is: For a published segment, the number will either be null or accurate. If null, then the Broker has not received the row count yet. For an unpublished segment, the number will be slightly out of date as new data arrives. (Assuming this is an accurate statement.) 

##########
File path: docs/querying/sql.md
##########
@@ -1123,20 +1123,23 @@ Segments table provides details on all Druid segments, whether they are publishe
 |version|STRING|Version string (generally an ISO8601 timestamp corresponding to when the segment set was first started). Higher version means the more recently created segment. Version comparing is based on string comparison.|
 |partition_num|LONG|Partition number (an integer, unique within a datasource+interval+version; may not necessarily be contiguous)|
 |num_replicas|LONG|Number of replicas of this segment currently being served|
-|num_rows|LONG|Number of rows in current segment, this value could be null if unknown to Broker at query time|
-|is_published|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 represents this segment has been published to the metadata store with `used=1`. See the [Architecture page](../design/architecture.md#segment-lifecycle) for more details.|
-|is_available|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 if this segment is currently being served by any process(Historical or realtime). See the [Architecture page](../design/architecture.md#segment-lifecycle) for more details.|
-|is_realtime|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 if this segment is _only_ served by realtime tasks, and 0 if any historical process is serving this segment.|
-|is_overshadowed|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 if this segment is published and is _fully_ overshadowed by some other published segments. Currently, is_overshadowed is always false for unpublished segments, although this may change in the future. You can filter for segments that "should be published" by filtering for `is_published = 1 AND is_overshadowed = 0`. Segments can briefly be both published and overshadowed if they were recently replaced, but have not been unpublished yet. See the [Architecture page](../design/architecture.md#segment-lifecycle) for more details.|
+|num_rows|LONG|Number of rows in this segment. This field is updated in the background and cached on the Broker. It may be null if the Broker has not gathered a row count for this segment yet. It may not match the result of `count(*)` queries on realtime data, because the cached value on the Broker may be out of date, and because different replicas of realtime segments may not be in sync with each other. Once a segment is published, its row count will settle and stop changing.|
+|is_active|LONG|Boolean represented as long type where 1 = true, 0 = false. True for segments that are either available and queryable, or _should be_ available and querayble. Equivalent to `(is_published = 1 AND is_overshadowed = 0) OR is_realtime = 1`.|
+|is_published|LONG|Boolean represented as long type where 1 = true, 0 = false. 1 represents this segment has been published to the metadata store with `used=1`. See the [segment lifecycle documentation](../design/architecture.md#segment-lifecycle) for more details.|
+|is_available|LONG|Boolean represented as long type where 1 = true, 0 = false. 1 if this segment is currently being served by any process(Historical or realtime). See the [segment lifecycle documentation](../design/architecture.md#segment-lifecycle) for more details.|
+|is_realtime|LONG|Boolean represented as long type where 1 = true, 0 = false. 1 if this segment is _only_ served by realtime tasks, and 0 if any historical process is serving this segment.|
+|is_overshadowed|LONG|Boolean represented as long type where 1 = true, 0 = false. 1 if this segment is published and is _fully_ overshadowed by some other published segments. Currently, is_overshadowed is always 0 for unpublished segments, although this may change in the future. You can filter for segments that "should be published" by filtering for `is_published = 1 AND is_overshadowed = 0`. Segments can briefly be both published and overshadowed if they were recently replaced, but have not been unpublished yet. See the [segment lifecycle documentation](../design/architecture.md#segment-lifecycle) for more details.|

Review comment:
       Nit: consistent use of code font: `is_overshadowed`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] techdocsmith commented on pull request #11550: SQL: Add is_active to sys.segments, update examples and docs.

Posted by GitBox <gi...@apache.org>.
techdocsmith commented on pull request #11550:
URL: https://github.com/apache/druid/pull/11550#issuecomment-1075785091


   @vtlim, i think this might have merge conflicts due to the sql refactor. Any way we can get @gianm updates into the current structure?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] paul-rogers commented on a change in pull request #11550: SQL: Add is_active to sys.segments, update examples and docs.

Posted by GitBox <gi...@apache.org>.
paul-rogers commented on a change in pull request #11550:
URL: https://github.com/apache/druid/pull/11550#discussion_r682987219



##########
File path: docs/querying/sql.md
##########
@@ -1123,20 +1123,23 @@ Segments table provides details on all Druid segments, whether they are publishe
 |version|STRING|Version string (generally an ISO8601 timestamp corresponding to when the segment set was first started). Higher version means the more recently created segment. Version comparing is based on string comparison.|
 |partition_num|LONG|Partition number (an integer, unique within a datasource+interval+version; may not necessarily be contiguous)|
 |num_replicas|LONG|Number of replicas of this segment currently being served|
-|num_rows|LONG|Number of rows in current segment, this value could be null if unknown to Broker at query time|
-|is_published|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 represents this segment has been published to the metadata store with `used=1`. See the [Architecture page](../design/architecture.md#segment-lifecycle) for more details.|
-|is_available|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 if this segment is currently being served by any process(Historical or realtime). See the [Architecture page](../design/architecture.md#segment-lifecycle) for more details.|
-|is_realtime|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 if this segment is _only_ served by realtime tasks, and 0 if any historical process is serving this segment.|
-|is_overshadowed|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 if this segment is published and is _fully_ overshadowed by some other published segments. Currently, is_overshadowed is always false for unpublished segments, although this may change in the future. You can filter for segments that "should be published" by filtering for `is_published = 1 AND is_overshadowed = 0`. Segments can briefly be both published and overshadowed if they were recently replaced, but have not been unpublished yet. See the [Architecture page](../design/architecture.md#segment-lifecycle) for more details.|
+|num_rows|LONG|Number of rows in this segment. This field is updated in the background and cached on the Broker. It may be null if the Broker has not gathered a row count for this segment yet. It may not match the result of `count(*)` queries on realtime data, because the cached value on the Broker may be out of date, and because different replicas of realtime segments may not be in sync with each other. Once a segment is published, its row count will settle and stop changing.|
+|is_active|LONG|Boolean represented as long type where 1 = true, 0 = false. True for segments that are either available and queryable, or _should be_ available and querayble. Equivalent to `(is_published = 1 AND is_overshadowed = 0) OR is_realtime = 1`.|

Review comment:
       This is the second (third) place in the docs that emphasizes *should*. Is this notion explained anywhere? Does this mean that the segment is scheduled to load into a Historical, but has not yet done so? Or, does it mean there is some kind of problem that the user must resolve?

##########
File path: docs/querying/sql.md
##########
@@ -1123,20 +1123,23 @@ Segments table provides details on all Druid segments, whether they are publishe
 |version|STRING|Version string (generally an ISO8601 timestamp corresponding to when the segment set was first started). Higher version means the more recently created segment. Version comparing is based on string comparison.|
 |partition_num|LONG|Partition number (an integer, unique within a datasource+interval+version; may not necessarily be contiguous)|
 |num_replicas|LONG|Number of replicas of this segment currently being served|
-|num_rows|LONG|Number of rows in current segment, this value could be null if unknown to Broker at query time|
-|is_published|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 represents this segment has been published to the metadata store with `used=1`. See the [Architecture page](../design/architecture.md#segment-lifecycle) for more details.|
-|is_available|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 if this segment is currently being served by any process(Historical or realtime). See the [Architecture page](../design/architecture.md#segment-lifecycle) for more details.|
-|is_realtime|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 if this segment is _only_ served by realtime tasks, and 0 if any historical process is serving this segment.|
-|is_overshadowed|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 if this segment is published and is _fully_ overshadowed by some other published segments. Currently, is_overshadowed is always false for unpublished segments, although this may change in the future. You can filter for segments that "should be published" by filtering for `is_published = 1 AND is_overshadowed = 0`. Segments can briefly be both published and overshadowed if they were recently replaced, but have not been unpublished yet. See the [Architecture page](../design/architecture.md#segment-lifecycle) for more details.|
+|num_rows|LONG|Number of rows in this segment. This field is updated in the background and cached on the Broker. It may be null if the Broker has not gathered a row count for this segment yet. It may not match the result of `count(*)` queries on realtime data, because the cached value on the Broker may be out of date, and because different replicas of realtime segments may not be in sync with each other. Once a segment is published, its row count will settle and stop changing.|
+|is_active|LONG|Boolean represented as long type where 1 = true, 0 = false. True for segments that are either available and queryable, or _should be_ available and querayble. Equivalent to `(is_published = 1 AND is_overshadowed = 0) OR is_realtime = 1`.|
+|is_published|LONG|Boolean represented as long type where 1 = true, 0 = false. 1 represents this segment has been published to the metadata store with `used=1`. See the [segment lifecycle documentation](../design/architecture.md#segment-lifecycle) for more details.|

Review comment:
       Presumably "published to the metadata store" means "by the MiddleManager at the completion of ingestion"?

##########
File path: docs/querying/sql.md
##########
@@ -1123,20 +1123,23 @@ Segments table provides details on all Druid segments, whether they are publishe
 |version|STRING|Version string (generally an ISO8601 timestamp corresponding to when the segment set was first started). Higher version means the more recently created segment. Version comparing is based on string comparison.|
 |partition_num|LONG|Partition number (an integer, unique within a datasource+interval+version; may not necessarily be contiguous)|
 |num_replicas|LONG|Number of replicas of this segment currently being served|
-|num_rows|LONG|Number of rows in current segment, this value could be null if unknown to Broker at query time|
-|is_published|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 represents this segment has been published to the metadata store with `used=1`. See the [Architecture page](../design/architecture.md#segment-lifecycle) for more details.|
-|is_available|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 if this segment is currently being served by any process(Historical or realtime). See the [Architecture page](../design/architecture.md#segment-lifecycle) for more details.|
-|is_realtime|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 if this segment is _only_ served by realtime tasks, and 0 if any historical process is serving this segment.|
-|is_overshadowed|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 if this segment is published and is _fully_ overshadowed by some other published segments. Currently, is_overshadowed is always false for unpublished segments, although this may change in the future. You can filter for segments that "should be published" by filtering for `is_published = 1 AND is_overshadowed = 0`. Segments can briefly be both published and overshadowed if they were recently replaced, but have not been unpublished yet. See the [Architecture page](../design/architecture.md#segment-lifecycle) for more details.|
+|num_rows|LONG|Number of rows in this segment. This field is updated in the background and cached on the Broker. It may be null if the Broker has not gathered a row count for this segment yet. It may not match the result of `count(*)` queries on realtime data, because the cached value on the Broker may be out of date, and because different replicas of realtime segments may not be in sync with each other. Once a segment is published, its row count will settle and stop changing.|

Review comment:
       Change the wording a bit? Seems the key bit for a user to know is: For a published segment, the number will either be null or accurate. If null, then the Broker has not received the row count yet. For an unpublished segment, the number will be slightly out of date as new data arrives. (Assuming this is an accurate statement.) 

##########
File path: docs/querying/sql.md
##########
@@ -1123,20 +1123,23 @@ Segments table provides details on all Druid segments, whether they are publishe
 |version|STRING|Version string (generally an ISO8601 timestamp corresponding to when the segment set was first started). Higher version means the more recently created segment. Version comparing is based on string comparison.|
 |partition_num|LONG|Partition number (an integer, unique within a datasource+interval+version; may not necessarily be contiguous)|
 |num_replicas|LONG|Number of replicas of this segment currently being served|
-|num_rows|LONG|Number of rows in current segment, this value could be null if unknown to Broker at query time|
-|is_published|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 represents this segment has been published to the metadata store with `used=1`. See the [Architecture page](../design/architecture.md#segment-lifecycle) for more details.|
-|is_available|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 if this segment is currently being served by any process(Historical or realtime). See the [Architecture page](../design/architecture.md#segment-lifecycle) for more details.|
-|is_realtime|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 if this segment is _only_ served by realtime tasks, and 0 if any historical process is serving this segment.|
-|is_overshadowed|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 if this segment is published and is _fully_ overshadowed by some other published segments. Currently, is_overshadowed is always false for unpublished segments, although this may change in the future. You can filter for segments that "should be published" by filtering for `is_published = 1 AND is_overshadowed = 0`. Segments can briefly be both published and overshadowed if they were recently replaced, but have not been unpublished yet. See the [Architecture page](../design/architecture.md#segment-lifecycle) for more details.|
+|num_rows|LONG|Number of rows in this segment. This field is updated in the background and cached on the Broker. It may be null if the Broker has not gathered a row count for this segment yet. It may not match the result of `count(*)` queries on realtime data, because the cached value on the Broker may be out of date, and because different replicas of realtime segments may not be in sync with each other. Once a segment is published, its row count will settle and stop changing.|
+|is_active|LONG|Boolean represented as long type where 1 = true, 0 = false. True for segments that are either available and queryable, or _should be_ available and querayble. Equivalent to `(is_published = 1 AND is_overshadowed = 0) OR is_realtime = 1`.|
+|is_published|LONG|Boolean represented as long type where 1 = true, 0 = false. 1 represents this segment has been published to the metadata store with `used=1`. See the [segment lifecycle documentation](../design/architecture.md#segment-lifecycle) for more details.|
+|is_available|LONG|Boolean represented as long type where 1 = true, 0 = false. 1 if this segment is currently being served by any process(Historical or realtime). See the [segment lifecycle documentation](../design/architecture.md#segment-lifecycle) for more details.|
+|is_realtime|LONG|Boolean represented as long type where 1 = true, 0 = false. 1 if this segment is _only_ served by realtime tasks, and 0 if any historical process is serving this segment.|
+|is_overshadowed|LONG|Boolean represented as long type where 1 = true, 0 = false. 1 if this segment is published and is _fully_ overshadowed by some other published segments. Currently, is_overshadowed is always 0 for unpublished segments, although this may change in the future. You can filter for segments that "should be published" by filtering for `is_published = 1 AND is_overshadowed = 0`. Segments can briefly be both published and overshadowed if they were recently replaced, but have not been unpublished yet. See the [segment lifecycle documentation](../design/architecture.md#segment-lifecycle) for more details.|

Review comment:
       Nit: consistent use of code font: `is_overshadowed`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] vtlim commented on a change in pull request #11550: SQL: Add is_active to sys.segments, update examples and docs.

Posted by GitBox <gi...@apache.org>.
vtlim commented on a change in pull request #11550:
URL: https://github.com/apache/druid/pull/11550#discussion_r774106454



##########
File path: docs/querying/sql.md
##########
@@ -1193,20 +1193,23 @@ Segments table provides details on all Druid segments, whether they are publishe
 |version|STRING|Version string (generally an ISO8601 timestamp corresponding to when the segment set was first started). Higher version means the more recently created segment. Version comparing is based on string comparison.|
 |partition_num|LONG|Partition number (an integer, unique within a datasource+interval+version; may not necessarily be contiguous)|
 |num_replicas|LONG|Number of replicas of this segment currently being served|
-|num_rows|LONG|Number of rows in current segment, this value could be null if unknown to Broker at query time|
-|is_published|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 represents this segment has been published to the metadata store with `used=1`. See the [Architecture page](../design/architecture.md#segment-lifecycle) for more details.|
-|is_available|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 if this segment is currently being served by any process(Historical or realtime). See the [Architecture page](../design/architecture.md#segment-lifecycle) for more details.|
-|is_realtime|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 if this segment is _only_ served by realtime tasks, and 0 if any historical process is serving this segment.|
-|is_overshadowed|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 if this segment is published and is _fully_ overshadowed by some other published segments. Currently, is_overshadowed is always false for unpublished segments, although this may change in the future. You can filter for segments that "should be published" by filtering for `is_published = 1 AND is_overshadowed = 0`. Segments can briefly be both published and overshadowed if they were recently replaced, but have not been unpublished yet. See the [Architecture page](../design/architecture.md#segment-lifecycle) for more details.|
+|num_rows|LONG|Number of rows in this segment, or zero if the number of rows is not known.<br /><br />This row count is gathered by the Broker in the background. It will be zero if the Broker has not gathered a row count for this segment yet. For segments ingested from streams, the reported row count may lag behind the result of a `count(*)` query because the cached `num_rows` on the Broker may be out of date. This will settle shortly after new rows stop being written to that particular segment.|
+|is_active|LONG|True for segments that represent the latest state of a datasource.<br /><br />Equivalent to `(is_published = 1 AND is_overshadowed = 0) OR is_realtime = 1`. In steady state, when no ingestion or data management operations are happening, `is_active` will be equivalent to `is_available`. However, they may differ from each other when ingestion or data management operations have executed recently. In these cases, Druid will load and unload segments appropriately to bring actual availability in line with the expected state given by `is_active`.|
+|is_published|LONG|Boolean represented as long type where 1 = true, 0 = false. 1 if this segment has been published to the metadata store and is marked as used. See the [segment lifecycle documentation](../design/architecture.md#segment-lifecycle) for more details.|
+|is_available|LONG|Boolean represented as long type where 1 = true, 0 = false. 1 if this segment is currently being served by any data serving process, like a Historical or a realtime ingestion task. See the [segment lifecycle documentation](../design/architecture.md#segment-lifecycle) for more details.|
+|is_realtime|LONG|Boolean represented as long type where 1 = true, 0 = false. 1 if this segment is _only_ served by realtime tasks, and 0 if any Historical process is serving this segment.|
+|is_overshadowed|LONG|Boolean represented as long type where 1 = true, 0 = false. 1 if this segment is published and is _fully_ overshadowed by some other published segments. Currently, `is_overshadowed` is always 0 for unpublished segments, although this may change in the future. You can filter for segments that "should be published" by filtering for `is_published = 1 AND is_overshadowed = 0`. Segments can briefly be both published and overshadowed if they were recently replaced, but have not been unpublished yet. See the [segment lifecycle documentation](../design/architecture.md#segment-lifecycle) for more details.|
 |shard_spec|STRING|JSON-serialized form of the segment `ShardSpec`|
 |dimensions|STRING|JSON-serialized form of the segment dimensions|
 |metrics|STRING|JSON-serialized form of the segment metrics|
 |last_compaction_state|STRING|JSON-serialized form of the compaction task's config (compaction task which created this segment). May be null if segment was not created by compaction task.|
 
-For example to retrieve all segments for datasource "wikipedia", use the query:
+For example to retrieve all currently-active segments for datasource "wikipedia", use the query:

Review comment:
       ```suggestion
   For example to retrieve all currently active segments for datasource "wikipedia", use the query:
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org