You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@druid.apache.org by GitBox <gi...@apache.org> on 2018/07/09 20:47:32 UTC

[GitHub] gianm opened a new issue #5989: SQL: System tables for introspection

gianm opened a new issue #5989: SQL: System tables for introspection
URL: https://github.com/apache/incubator-druid/issues/5989

It is common amongst DBMSes to expose system information through special system tables. This practice helps people understand the system better and provides visibility into what's going on. Druid has a couple already, see: http://druid.io/docs/latest/querying/sql.html#retrieving-metadata.

In addition to the INFORMATION_SCHEMA I propose adding a "system" schema with :

- `system.segments` -- a row per segment. It should include both segments from the metadata store _and_ segments that are served, but not yet published (i.e. segments that are served from realtime tasks). It should include the DataSegment payload as one of its columns, plus some other useful columns like: dataSource, start, end, is_published, is_available, is_realtime.
- `system.servers` -- a row per server. It should include host, port, and server type. I think at first it would be ok to only include data servers here (i.e. historicals and peons). Eventually we would want to include all server types.
- `system.server_segments` -- a row per segment per server. The idea is that this is a table that can be used to join `segments` with `servers`.
- `system.tasks` -- a row per task.

The segment-related tables would largely be meant as more user-friendly alternatives/replacements to the hodgepodge of REST APIs on http://druid.io/docs/latest/design/coordinator.html. The task table would be an alternative to the task APIs on the overlord.

For these tables to work, the information needs to be available on the broker. Some of this data it already has pre-cached (for example: served segments for `system.segments` are in one of its ServerViews, as are all the data servers for `system.servers`, and all the server-segment mappings for `system.server_segments`). Some of this data the broker does _not_ know today (for example: metadata segments for `system.segments` are only known to the coordinator, and tasks for `system.tasks` are only known to the overlord).

For the data that the broker already has, the system tables should be wired up to it, similar to how INFORMATION_SCHEMA works.

For the data that the broker does _not_ have, we could either pull it in and pre-cache it, or we could pull it on demand. For metadata segments I could see this going either way. For tasks I think we should not pull and cache -- it changes frequently and we don't want the data to be stale (users will likely rely on this table to be up-to-the-second). So for tasks we should pull "live" from the overlord on each query. There isn't an example of this being done today, but it's something Calcite is capable of. We should also endeavor to push down filters (like dataSource, task type) where possible.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@druid.apache.org
For additional commands, e-mail: dev-help@druid.apache.org