You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by "LakshSingla (via GitHub)" <gi...@apache.org> on 2023/03/02 06:57:34 UTC
[GitHub] [druid] LakshSingla commented on a diff in pull request #13686: Integrate the catalog with the Calcite planner

LakshSingla commented on code in PR #13686:
URL: https://github.com/apache/druid/pull/13686#discussion_r1092794902


##########
sql/src/main/codegen/includes/insert.ftl:
##########
@@ -21,13 +21,14 @@
 SqlNode DruidSqlInsertEof() :
 {
   SqlNode insertNode;
-  org.apache.druid.java.util.common.Pair<Granularity, String> partitionedBy = new org.apache.druid.java.util.common.Pair(null, null);
+  SqlNode partitionedBy = null;
   SqlNodeList clusteredBy = null;
 }
 {
   insertNode = SqlInsert()
-  // PARTITIONED BY is necessary, but is kept optional in the grammar. It is asserted that it is not missing in the
-  // DruidSqlInsert constructor so that we can return a custom error message.
+  // PARTITIONED BY is necessary. It can be provided either in this statement or in the catalog.

Review Comment:
   Does this mean that queries of the following type are allowed as a part of this change?
   ```INSERT INTO table
   SELECT * FROM TABLE(EXTERN(...))
   CLUSTERED BY row1, row2
   ```



##########
docs/catalog/api.md:
##########
@@ -0,0 +1,261 @@
+---
+id: intro
+title: Table Metadata Catalog
+sidebar_label: Intro
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+The table metadata feature adds a number of new REST APIs to create, read, update and delete
+(CRUD) table metadata entries. Table metadata is a set of "hints" on top of physical datasources, and
+a set of convenience entries for external tables. Changing the metadata "hints" has no effect
+on existing datasources or segments, or on in-flight ingestion jobs. Metadata changes only
+affect future queries and integestion jobs.
+
+**Note**: a future enhancment might allow a single operation to, say, delete both a metadata
+entry and the actual datasource. That functionality is not yet available. In this version, each
+metadata operation affects _only_ the information in the table metadata table within Druid's
+metadata store.
+
+A metadata table entry is called a "table specification" (or "table spec") and is represented
+by a Java class called `TableSpec`. The JSON representation of this object appears below.
+
+The API calls are designed for two use cases:
+
+* Configuration-as-code, in which the application uploads a complete table spec. Like in
+Kubernetes or other deployment tools, the source of truth for specs is in a source repostitory,
+typically Git. The user deploys new table specs to Druid as part of rolling out an updated
+application.
+* UI-driven, in which you use the Druid Console or similar tool to edit the table metadata
+directly.
+
+The configuration-as-code use case works with entire specs. The UI use case may work with an
+entire spec, or may perform focused "edit" operations. Here it is important to remember that
+Druid is a distributed, multi-user system: it could be that multiple users edit the same
+table at the same time. The API to overwrite a table spec provides a version number to allow
+a UI to implement optimistic locking: if the current verion is not the one the UI read, then
+Druid rejects the change. A set of "edit" operations perform focused chagnges that avoid the

Review Comment:
   nit:
   ```suggestion
   Druid rejects the change. A set of "edit" operations perform focused changes that avoid the
   ```



##########
docs/catalog/index.md:
##########
@@ -0,0 +1,314 @@
+---
+id: index
+title: Table Metadata Catalog
+sidebar_label: Introduction
+description: Introduces the table metadata features
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+Druid 26.0 introduces the idea of _metadata_ to describe datasources and external
+tables. External table metadata simplifies MSQ ingestion statements. Datasource
+metadata gives you another data governance tool to manage and communicate the data
+stored in a datasource. The user of table metadata is entirely optional: you can
+use it for no tables, some tables or all tables as your needs evolve.
+
+The table metadata feature consists of the main parts:
+
+* A new table in Druid's catalog (also known as the metadata store),
+* A set of new REST APIs to manage table metadata entries,
+* SQL integratoin to put the metadata to work.

Review Comment:
   ```suggestion
   * SQL integration to put the metadata to work.
   ```



##########
docs/catalog/api.md:
##########
@@ -0,0 +1,261 @@
+---
+id: intro
+title: Table Metadata Catalog
+sidebar_label: Intro
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+The table metadata feature adds a number of new REST APIs to create, read, update and delete
+(CRUD) table metadata entries. Table metadata is a set of "hints" on top of physical datasources, and
+a set of convenience entries for external tables. Changing the metadata "hints" has no effect
+on existing datasources or segments, or on in-flight ingestion jobs. Metadata changes only
+affect future queries and integestion jobs.
+
+**Note**: a future enhancment might allow a single operation to, say, delete both a metadata
+entry and the actual datasource. That functionality is not yet available. In this version, each
+metadata operation affects _only_ the information in the table metadata table within Druid's
+metadata store.
+
+A metadata table entry is called a "table specification" (or "table spec") and is represented
+by a Java class called `TableSpec`. The JSON representation of this object appears below.
+
+The API calls are designed for two use cases:
+
+* Configuration-as-code, in which the application uploads a complete table spec. Like in
+Kubernetes or other deployment tools, the source of truth for specs is in a source repostitory,
+typically Git. The user deploys new table specs to Druid as part of rolling out an updated
+application.
+* UI-driven, in which you use the Druid Console or similar tool to edit the table metadata
+directly.
+
+The configuration-as-code use case works with entire specs. The UI use case may work with an
+entire spec, or may perform focused "edit" operations. Here it is important to remember that
+Druid is a distributed, multi-user system: it could be that multiple users edit the same
+table at the same time. The API to overwrite a table spec provides a version number to allow
+a UI to implement optimistic locking: if the current verion is not the one the UI read, then
+Druid rejects the change. A set of "edit" operations perform focused chagnges that avoid the
+need for optimistic locking: Druid applies the change rather than the UI.
+
+All table metadata APIs begin with a common prefix: `/druid/coordinator/v1/catalog`.
+
+Tables are identified by a path: the schema and table name. Druid has a fixed, pre-defined set
+of schemas:
+
+* `druid`: Datasources
+* `ext`: External tables
+
+There are others, but those do not accept table metadata entries.
+
+## Configuration-as-Code
+
+CRUD for an entire `TableSpec`, optionally protected by a version.
+
+### `POST {prefix}/schemas/{schema}/tables/{table}[?version={n}|overwrite=true|false]`
+
+Create or update a `TableSpec`.
+
+* With no options, the semantics are "create if not exists."
+* With a version, the semantics are "update if exists and is at the given version"
+* With "overwrite", the semantics are "create or update"
+
+Use `overwrite=true` when some other system is the source of truth: you want to force
+the table metadata to match. Otherwise, first read the existing metadata, which provides
+the data version. Apply changes and set `version={n}` to the version retrieved. Druid will
+reject the chagne it other changes have occured since the data was read. The client should

Review Comment:
   nit:
   ```suggestion
   reject the change it other changes have occurred since the data was read. The client should
   ```



##########
docs/catalog/api.md:
##########
@@ -0,0 +1,261 @@
+---
+id: intro
+title: Table Metadata Catalog
+sidebar_label: Intro
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+The table metadata feature adds a number of new REST APIs to create, read, update and delete
+(CRUD) table metadata entries. Table metadata is a set of "hints" on top of physical datasources, and
+a set of convenience entries for external tables. Changing the metadata "hints" has no effect
+on existing datasources or segments, or on in-flight ingestion jobs. Metadata changes only
+affect future queries and integestion jobs.
+
+**Note**: a future enhancment might allow a single operation to, say, delete both a metadata
+entry and the actual datasource. That functionality is not yet available. In this version, each
+metadata operation affects _only_ the information in the table metadata table within Druid's
+metadata store.
+
+A metadata table entry is called a "table specification" (or "table spec") and is represented
+by a Java class called `TableSpec`. The JSON representation of this object appears below.
+
+The API calls are designed for two use cases:
+
+* Configuration-as-code, in which the application uploads a complete table spec. Like in
+Kubernetes or other deployment tools, the source of truth for specs is in a source repostitory,
+typically Git. The user deploys new table specs to Druid as part of rolling out an updated
+application.
+* UI-driven, in which you use the Druid Console or similar tool to edit the table metadata
+directly.
+
+The configuration-as-code use case works with entire specs. The UI use case may work with an
+entire spec, or may perform focused "edit" operations. Here it is important to remember that
+Druid is a distributed, multi-user system: it could be that multiple users edit the same
+table at the same time. The API to overwrite a table spec provides a version number to allow
+a UI to implement optimistic locking: if the current verion is not the one the UI read, then
+Druid rejects the change. A set of "edit" operations perform focused chagnges that avoid the
+need for optimistic locking: Druid applies the change rather than the UI.
+
+All table metadata APIs begin with a common prefix: `/druid/coordinator/v1/catalog`.
+
+Tables are identified by a path: the schema and table name. Druid has a fixed, pre-defined set
+of schemas:
+
+* `druid`: Datasources
+* `ext`: External tables
+
+There are others, but those do not accept table metadata entries.
+
+## Configuration-as-Code
+
+CRUD for an entire `TableSpec`, optionally protected by a version.
+
+### `POST {prefix}/schemas/{schema}/tables/{table}[?version={n}|overwrite=true|false]`
+
+Create or update a `TableSpec`.
+
+* With no options, the semantics are "create if not exists."
+* With a version, the semantics are "update if exists and is at the given version"
+* With "overwrite", the semantics are "create or update"
+
+Use `overwrite=true` when some other system is the source of truth: you want to force
+the table metadata to match. Otherwise, first read the existing metadata, which provides
+the data version. Apply changes and set `version={n}` to the version retrieved. Druid will
+reject the chagne it other changes have occured since the data was read. The client should
+read the new version, reapply the changes,and submit the change again.
+
+### `DELETE {prefix}/schemas/{schema}/tables/{table}`
+
+Removes the metadata for the given table. Note that this API _does not_ affect the datasource itself
+or any segments for the datasource. Deleting a table metadata entry simply says that you no longer
+wish to use the data governance features for that datasource.
+
+## Editing
+
+Edit operations apply a specific transform to an existing table spec. Because the transform is very
+specific, there is no need for a version as there is little scope for unintentional overwrites. The
+edit actions are specficially designed to be use by UI tools such as the Druid Console.
+
+### `POST {prefix}/schemas/{schema}/tables/{table}/edit`
+
+The payload is a message that says what to change:
+
+* Hide columns (add to the hidden columns list)
+* Unhide columns (remove from the hidden columns list)
+* Drop columns (remove items from the columns list)
+* Move columns
+* Update props (merge updated properties with existing)
+* Update columns (merge updated columns with existing)
+
+## Retrieval
+
+Finally, the API provides a variety of ways to pull information from the table metadata
+storage. Again, the information provided here is _only_ the metadata hints. Query the
+Druid system tables to get the "effective" schema for all Druid datasources. The effective
+schema includes not just the tables and columns defined in metadata, but any additional
+items that physically exist in segments.
+
+### `GET {prefix}/schemas[?format=name|path|metadata]`
+
+Returns one of:
+
+* `name`: The list of schema names (Default)
+* `path`: The list of all table paths (i.e. `{schema}.{table}` pairs)
+* `metadata`: A list of metadata for all tables in all schemas.
+
+### `GET {prefix}/schemas/{schema}`
+
+Not supported. Reserved to obtain information about the schema itself.
+
+### `GET {prefix}/schemas/{schema}/tables[?format=name|metadata`]
+
+Returns one of:
+
+* `name`: The list of table names within the schema.
+* `metadata`: The list of `TableMetadata` for each table within the schema.
+
+### `GET {prefix}/schemas/{schema}/tables/{table}[?format=spec|metadata|status`
+
+Retrieves information for one table. Formats are:
+
+* `spec`: The `TableSpec` (default)
+* `metadata`: the `TableMetadata`
+* `status`: the `TableMetadata` without the spec (that is, the name, state, creation date and update date)
+
+### Synchronization
+
+Druid uses two additional APIs internally to synchronize table metadata. The Coordinator
+stores the metadata, and services the APIs described above. The Broker uses the metadata
+to plan queries. These synchronization API keep the Broker in sync with the Coordinator.
+You should never need to use these (unless you are also creating a synchronization solution),
+but you may want to know about them to understand traffic beteen Druid servers.
+
+### `GET /sync/tables`
+
+ Returns a list of `TableSpec` objects for bootstrapping a Broker.
+
+Similar to`GET /entry/tables`, but that message returns metadata.
+
+### `POST /sync/delta`
+
+On the broker, receives update (delta) messages from the coordinator. The payload is an object:
+
+* Create/update: table path plus the `TableSpec`
+* Delete: table path only
+
+## Table Specification
+
+You submit a table spec when creating or updating a table metadataentry.
+The table spec object consists of four parts:
+
+* `type`: the kind of table described by the spec
+* `properties`: the set of properties for the table (see below)
+* `columns`: the set of columns for the table.
+
+Example:
+
+```json
+"spec": {
+  "type": "<type>",
+  "properties": { ... },
+  "columns": [ ... ]
+}
+```
+
+### Datasource Table Spec
+
+The syntax of a datasource table spec follows the general pattern. A datasource is
+described by a set of properties, as defined (LINK NEEDED). The type of a datasource
+table is `datasource`. Columns are of type `column`.
+
+Example:
+
+```json
+{
+   "type":"datasource",
+   "properties": {
+     "description": "<text>",
+     "segmentGranularity": "<period>",
+     "targetSegmentRows": <number>,
+     "clusterKeys": [ { "column": "<name">, "desc": true|false } ... ],
+     "hiddenColumns: [ "<name>" ... ]
+   },
+   "columns": [ ... ],
+}
+```
+
+### External Table Spec
+
+An external table spec has the same basic structure. Most external tables have just
+two properties:
+
+`source`: The JSON-serialized form of a Druid input source.
+`format`: The JSON-serialized form of a Druid input format.
+
+These two properties directly correspond to the JSON you would use with the MSQ
+`extern` statement. (In the catalog, the third argument, signature, is computed
+from the set of columns in the table spec.)
+
+See (LINK NEEDED) for details on the available input sources and formats, and on
+how to define a partial table completed with additional information at ingest time.
+
+### Table Metadata
+
+Druid returns a table metadata entry when you retrieve data for one or more tables.
+Each instance includes the table spec plus additional "meta-metadata" about the entry.
+
+* `id`: a map of two fields that gives the table path.
+  * `schema`: one of the Druid-supported schemas described above.
+  * `name`: table name
+* `creationTime`: UTC Timestamp (in milliseconds since the epoch) of the original creation
+time of the metadata entry (not actual datasource) was created.
+* `updateTime`: UTC Timestamp (in milliseconds since the epoch) of the most recent creation
+or update. This is the record's "version" when using optimistic locking. Return this in
+the `version={updateTime}` field when doing an update using optimistic locking.
+* `state`: For datasources. Normally ACTIVE. May be DELETING if datasource deletion is in

Review Comment:
   Most of the metadata entries are "meta-metadata", while this one is for the data source itself. Would it make more sense to use a field like `datasourceState` or `tableState` to make that explicit to the user. 



##########
docs/catalog/index.md:
##########
@@ -0,0 +1,314 @@
+---
+id: index
+title: Table Metadata Catalog
+sidebar_label: Introduction
+description: Introduces the table metadata features
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+Druid 26.0 introduces the idea of _metadata_ to describe datasources and external
+tables. External table metadata simplifies MSQ ingestion statements. Datasource
+metadata gives you another data governance tool to manage and communicate the data
+stored in a datasource. The user of table metadata is entirely optional: you can
+use it for no tables, some tables or all tables as your needs evolve.
+
+The table metadata feature consists of the main parts:
+
+* A new table in Druid's catalog (also known as the metadata store),
+* A set of new REST APIs to manage table metadata entries,
+* SQL integratoin to put the metadata to work.
+
+Table metadata is an experimental feature in this release. We encourage you to try
+it and to provide feedback. However, since the feature is experimental, some parts
+may change based on feedback.
+
+In this section, we use the term "table" to refer to both datasources and external
+tables. To SQL, these are both tables, and the table metadata feature treats them
+similarly.
+
+## Contents
+
+- [Introduction](./index.md) - Overview of the table metadata feature.
+- [Metadata API](./api.md) - Reference for the REST API operations to create and modify metadata.
+- [Datasource Metadata](./datasource.md) - Details for datasource metadata and partial tables.
+- [External Tables](./ext.md) - Details about external tables used in MSQ ingestion.
+
+
+## Enable Table Metadata Storage
+
+Table metadata storage resides in an extension which is not loaded by default. Instead,
+you must explicitly "opt in" to using this feature. Table metadata store is required
+for the other features described in this section.
+
+Enable table metadata storage by adding `druid-catalog` to your list of loaded
+extensions in `_common/common.runtime.properties`:
+
+```text
+druid.extensions.loadList=[..., "druid-catalog"]
+```
+
+Enabing this extension causes Druid to create the table metadata table and to
+make the table metadata REST APIs available for use.
+
+## When to Use Table Metadata
+
+People use Druid in many ways. When learning Druid, we often load data just to
+see how Druid works and to experiment. For this, "classic" Druid is ideal: just
+load data and go: there is no need for additional datasource metadata.
+
+As we put Druid to work, we craft an application around Druid: a data pipeline
+that feeds data into Druid, UI (dashboards, applications) which pull data out,
+and deployment scripts to keep the system running. At this point, it can be
+useful to design the table schema with as much care as the rest of the system.
+What columns are needed? What column names will be the most clear to the end
+user? What data type should each column have?
+
+Historically, developers encoded these decisions into Druid ingestion specs.
+With MSQ, that information is encoded into SQL statements. To control changes,

Review Comment:
   nit: We should either expand MSQ when mentioning it for the first time or refer it to by the extension name since it's not a part of core Druid.



##########
docs/catalog/index.md:
##########
@@ -0,0 +1,314 @@
+---
+id: index
+title: Table Metadata Catalog
+sidebar_label: Introduction
+description: Introduces the table metadata features
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+Druid 26.0 introduces the idea of _metadata_ to describe datasources and external
+tables. External table metadata simplifies MSQ ingestion statements. Datasource
+metadata gives you another data governance tool to manage and communicate the data
+stored in a datasource. The user of table metadata is entirely optional: you can
+use it for no tables, some tables or all tables as your needs evolve.
+
+The table metadata feature consists of the main parts:
+
+* A new table in Druid's catalog (also known as the metadata store),
+* A set of new REST APIs to manage table metadata entries,
+* SQL integratoin to put the metadata to work.
+
+Table metadata is an experimental feature in this release. We encourage you to try
+it and to provide feedback. However, since the feature is experimental, some parts
+may change based on feedback.
+
+In this section, we use the term "table" to refer to both datasources and external
+tables. To SQL, these are both tables, and the table metadata feature treats them
+similarly.
+
+## Contents
+
+- [Introduction](./index.md) - Overview of the table metadata feature.
+- [Metadata API](./api.md) - Reference for the REST API operations to create and modify metadata.
+- [Datasource Metadata](./datasource.md) - Details for datasource metadata and partial tables.
+- [External Tables](./ext.md) - Details about external tables used in MSQ ingestion.
+
+
+## Enable Table Metadata Storage
+
+Table metadata storage resides in an extension which is not loaded by default. Instead,
+you must explicitly "opt in" to using this feature. Table metadata store is required
+for the other features described in this section.
+
+Enable table metadata storage by adding `druid-catalog` to your list of loaded
+extensions in `_common/common.runtime.properties`:
+
+```text
+druid.extensions.loadList=[..., "druid-catalog"]
+```
+
+Enabing this extension causes Druid to create the table metadata table and to
+make the table metadata REST APIs available for use.
+
+## When to Use Table Metadata
+
+People use Druid in many ways. When learning Druid, we often load data just to
+see how Druid works and to experiment. For this, "classic" Druid is ideal: just
+load data and go: there is no need for additional datasource metadata.
+
+As we put Druid to work, we craft an application around Druid: a data pipeline
+that feeds data into Druid, UI (dashboards, applications) which pull data out,
+and deployment scripts to keep the system running. At this point, it can be
+useful to design the table schema with as much care as the rest of the system.
+What columns are needed? What column names will be the most clear to the end
+user? What data type should each column have?
+
+Historically, developers encoded these decisions into Druid ingestion specs.
+With MSQ, that information is encoded into SQL statements. To control changes,
+the specs or SQL statements are versioned in Git, subject to reviews when changes
+occur.
+
+This approach works well when there is a single source of data for each datasource.
+It can become difficult to manage when we want to load multiple, different inputs
+into a single datasource, and we want to create a single, composite schema that
+represents all of them. Think, for example, of a metrics application in which
+data arrives from various metrics feeds such as logs and so on. We want to decide
+upon a common set of columns so that, say HTTP status is always stored as a string
+in a column called "http_status", even if some of the inputs use names such as
+"status", "httpCode", or "statusCode". This is, in fact, the classic [ETL
+(extract, transform and load)](https://en.wikipedia.org/wiki/Extract,_transform,_load)
+design challenge.
+
+In this case, having a known, defined datasource schema becomes essential. The job
+of each of many ingestion specs or MSQ queries is to transform the incoming data from
+the format in which it appears into the form we've designed for our datasource.
+While table metadata is generally useful, it is this many-to-one mapping scenario
+where it finds its greatest need.
+
+### Table Metadata is Optional
+
+Table metadata is a new feature. Existing Druid systems have physical datasource
+metadata, but no logical data. As a result, table metdata can be thought of as a set
+of "hints": optional information that, if present, provides additional features not
+available when only using the existing physical metadata. See the
+[datasource section](./datasource.md) for information on the use of metadata with
+datasources. Without external metadata, you can use the existing `extern()` table
+function to specify an MSQ input source, format and schema.
+
+### Table Metadata is Extensible
+
+This documentation explains the Druid-defined properties available for datasources
+and external tables. The metadata system is extensible: you can define other
+properties, as long as those properties use names distinct from the Druid-defined
+names. For example, you might want to track the source of the data, or might want
+to include display name or display format properties to help your UI display
+columns.
+
+## Overview of Table Metadata
+
+Here is a typical scenario to explain how table metadata works. Lets say we want
+to create a new datasource. (Although you can use table metadata with an existing
+datasource, this explaination will focus on a new one.) We know our input data is
+not quite what dashboard users expect, so we want to clean things up a bit.
+
+### Design the Datasource
+
+We start with pencil and paper (actually, a text editor or spreadsheet) and we
+list the input columns and their types. Next, we think about the name we want to
+appear for that column in our UI, and the data type we'd prefer. If we are going
+to use Druid's rollup feature, we also want to identify the aggregation we wish
+to use. The result might look like this:
+
+```text
+Input source: web-metrics.csv
+Datasource: webMetrics
+
+    Input Column               Datasource Column
+Name          Type        Name       Type      Aggregation
+----          ----        ----       ----      -----------
+timestamp     ISO string  __time     TIMESTAMP
+host          string      host       VARCHAR
+referer       string
+status        int         httpStatus VARCHAR
+request-size  int         bytesIn    BIGINT    SUM
+response-size int         bytesOut   BIGINT    SUM
+...
+```
+
+Here, we've decided we don't need the referer. For our metrics use, the referer
+might not be helpful, so we just ignore it.
+
+Notice that we use arbitrary types for the CSV input. CSV doesn't really have
+types. We'll need to convert these to SQL types later. For datasource columns,
+we use SQL types, not Druid types. There is a SQL type for each Druid type and
+Druid aggregation. See [Datasource metadata](./datasource.md) for details.
+
+For aggregates, we use a set of special aggregate types described [here](./datasource.md).
+In our case, we use `SUM(BIGINT)` for the aggregate types.
+
+If we are using rollup, we may want to truncate our timestamp. Druid rolls up
+to the millsecond level by default. Perhaps we want to roll up to the second
+level (that is, combine all data for a given host and status that occurs within
+any one second of time.) We do by specifying a parameter to the time data type:
+`TIMESTAMP('PT1S')`.
+
+A Druid datasource needs two additional properties: our time partitioning
+and clustering. [Time partitioning](../multi-stage-query/reference.html#partitioned-by)
+gives the size of the time chunk stored in each segment file. Perhaps our system is of
+moderate size, and we want our segments to hold a day of data, or `P1D`.
+
+We can also choose [clustering](../multi-stage-query/reference.html#clustered-by), which
+is handy as data sizes expand. Suppose we want to cluster on `host`.
+
+We have the option of "sealing" the schema, meaning that we _only_ want these columns
+to be ingested, but no others. We'll do that here. The default option is to be
+unsealed, meaning that ingestion can add new columns at will without the need to first
+declare them in the schema. Use the option that works best for your use case.
+
+We are now ready to define our datasource metadata. We can craft a JSON payload with the
+information (handy if we write a script to do the translation work.) Or, we
+can use the Druid web console. Either way, we create a metadata entry for
+our `webMetrics` datasource that contains the columns and types above.
+
+Here's an example of the JSON payload:
+
+```JSON
+TO DO
+```
+
+If using the REST API, we create the table using the (TODO) API (NEED LINK).
+
+### Define our Input Data
+
+We plan to load new data every day, just after midnight, so our ingestion frequency
+matches our time partitioning. (As our system grows, we will probably ingest more
+frequently, and perhaps shift to streaming ingestion, but let's start simple.)
+We don't want to repeat the same information on each ingest. Instead, we want to
+store that information once and reuse it.
+
+To make things simple, let's assume that Druid runs on only one server, so we can
+load from a local directory. Let's say we always put the data in `/var/metrics/web`
+and we name our directories with the date: `2022-12-02`, etc. Each directory has any
+number of files for that day. Let's further suppose that we know that our inputs are
+always CSV, and always have the columns we listed above.
+
+We model this in Druid by defining an external table that includes everything about
+the input _except_ the set of files (which change each day). Let's call our external
+table `webInput`. All external tables reside in the `ext` schema (namespace).
+
+We again use the (TODO) API, this time with a JSON payload for the eternal table.
+The bulk of the information is simliar to what we use in an MSQ `extern` function:
+the JSON serialized form of an input source and format. But, the input source is
+"partial", we leave off the actual files.
+
+Notice we use SQL types when defining the eternal table. Druid uses this information
+to parse the CSV file coluns into the desired SQL (and Druid) type.
+
+```JSON
+TO DO
+```
+
+### MSQ Ingest
+
+Our next step is to load some data. We'll use MSQ. We've defined our datasource, and
+the input files from which we'll read. We use MSQ to define the mapping from the
+input to the output.
+
+```sql
+INSERT INTO webMetrics
+SELECT
+  TIME_PARSE(timestamp) AS __time,
+  host,
+  CAST(status AS VARCHAR) AS httpStatus,
+  SUM("request-size") AS bytesIn,
+  SUM("response-size") AS bytesOut
+FROM ext.webInput(filePattern => '2202-12-02/*.csv')
+GROUP BY __time, host, status
+```
+
+Some things to notice:
+
+* Druid's `INSERT` statement has a special way to match columns. Unlike standard SQL,
+which matches by position, Druid SQL matches by name. Thus, we must use `AS` clauses
+to ensure our data has the same name as the column in our datasource.
+* We do not include a `TIME_FLOOR` function because Druid inserts that automatically
+based on the `TIMESTAMP('PT5M')` data type.
+* We include a `CAST` to convert `status` (a number) to `httpStatus`
+(as string). Druid will validate the type based on the table metadata.
+* We include the aggregations for the two aggregated fields.
+The aggregations must match those defined in table metadata.
+* We also must explicitly spell out the `GROUP BY` clause which must include all
+non-aggregated columns.
+* Notice that there is no `PARTITIONED BY` or `CLUSTERED BY` clauses: Druid fills those
+in from table metadata.
+
+If we make a mistake, and omit a required column alias, or use the wrong alias, the
+MSQ query will fail. This is part of the data governance idea: Druid just prevented us
+from creating segments that don't satisfy our declared table schema.
+
+### Query and Schema Evolution
+
+We can now query our data. Though it is not obvious in this example, the query experience
+is also guided by table metadata. To see this, suppose that we soon discover that no one
+actually needs the `bytesIn` field: our server doesn't accept `POST` requets and so all
+request are pretty small. To avoid confusing users, we want to remove that column. But,
+we've already ingested data. We could throw away the data and start over. Or, we could
+simply "hide" the unwanted column using the
+[hide columns](api.md) API.
+
+If we now do a `SELECT *` query we find that `bytesIn` is no longer available and we
+no longer have to explain to our users to ignore that column. While hiding a column is
+trivial in this case, imagine if you wanted to make that decision after loading a petabyte
+of data. Hiding a column gives you an immediate solution, even if it might take a long
+time to rewrite (compact) all the existign segments to physicaly remove the column.

Review Comment:
   ```suggestion
   time to rewrite (compact) all the existing segments to physically remove the column.
   ```



##########
docs/catalog/ext.md:
##########
@@ -0,0 +1,491 @@
+---
+id: ext
+title: External Tables
+sidebar_label: External Tables
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## External Tables
+
+Druid ingests data from external sources. Druid has three main ways to perform
+ingestion:
+
+* [MSQ](../multi-stage-query/reference.md#extern) (SQL-based ingestion)
+* "Classic" [batch ingestion](../ingestion/ingestion-spec.md#ioconfig)
+* Streaming ingestion
+
+In each cases, Druid reads from an external source defined by a triple of

Review Comment:
   ```suggestion
   In each case, Druid reads from an external source defined by a triple of
   ```



##########
docs/catalog/index.md:
##########
@@ -0,0 +1,314 @@
+---
+id: index
+title: Table Metadata Catalog
+sidebar_label: Introduction
+description: Introduces the table metadata features
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+Druid 26.0 introduces the idea of _metadata_ to describe datasources and external
+tables. External table metadata simplifies MSQ ingestion statements. Datasource
+metadata gives you another data governance tool to manage and communicate the data
+stored in a datasource. The user of table metadata is entirely optional: you can
+use it for no tables, some tables or all tables as your needs evolve.
+
+The table metadata feature consists of the main parts:
+
+* A new table in Druid's catalog (also known as the metadata store),
+* A set of new REST APIs to manage table metadata entries,
+* SQL integratoin to put the metadata to work.
+
+Table metadata is an experimental feature in this release. We encourage you to try
+it and to provide feedback. However, since the feature is experimental, some parts
+may change based on feedback.
+
+In this section, we use the term "table" to refer to both datasources and external
+tables. To SQL, these are both tables, and the table metadata feature treats them
+similarly.
+
+## Contents
+
+- [Introduction](./index.md) - Overview of the table metadata feature.
+- [Metadata API](./api.md) - Reference for the REST API operations to create and modify metadata.
+- [Datasource Metadata](./datasource.md) - Details for datasource metadata and partial tables.
+- [External Tables](./ext.md) - Details about external tables used in MSQ ingestion.
+
+
+## Enable Table Metadata Storage
+
+Table metadata storage resides in an extension which is not loaded by default. Instead,
+you must explicitly "opt in" to using this feature. Table metadata store is required
+for the other features described in this section.
+
+Enable table metadata storage by adding `druid-catalog` to your list of loaded
+extensions in `_common/common.runtime.properties`:
+
+```text
+druid.extensions.loadList=[..., "druid-catalog"]
+```
+
+Enabing this extension causes Druid to create the table metadata table and to
+make the table metadata REST APIs available for use.
+
+## When to Use Table Metadata
+
+People use Druid in many ways. When learning Druid, we often load data just to
+see how Druid works and to experiment. For this, "classic" Druid is ideal: just
+load data and go: there is no need for additional datasource metadata.
+
+As we put Druid to work, we craft an application around Druid: a data pipeline
+that feeds data into Druid, UI (dashboards, applications) which pull data out,
+and deployment scripts to keep the system running. At this point, it can be
+useful to design the table schema with as much care as the rest of the system.
+What columns are needed? What column names will be the most clear to the end
+user? What data type should each column have?
+
+Historically, developers encoded these decisions into Druid ingestion specs.
+With MSQ, that information is encoded into SQL statements. To control changes,
+the specs or SQL statements are versioned in Git, subject to reviews when changes
+occur.
+
+This approach works well when there is a single source of data for each datasource.
+It can become difficult to manage when we want to load multiple, different inputs
+into a single datasource, and we want to create a single, composite schema that
+represents all of them. Think, for example, of a metrics application in which
+data arrives from various metrics feeds such as logs and so on. We want to decide
+upon a common set of columns so that, say HTTP status is always stored as a string
+in a column called "http_status", even if some of the inputs use names such as
+"status", "httpCode", or "statusCode". This is, in fact, the classic [ETL
+(extract, transform and load)](https://en.wikipedia.org/wiki/Extract,_transform,_load)
+design challenge.
+
+In this case, having a known, defined datasource schema becomes essential. The job
+of each of many ingestion specs or MSQ queries is to transform the incoming data from
+the format in which it appears into the form we've designed for our datasource.
+While table metadata is generally useful, it is this many-to-one mapping scenario
+where it finds its greatest need.
+
+### Table Metadata is Optional
+
+Table metadata is a new feature. Existing Druid systems have physical datasource
+metadata, but no logical data. As a result, table metdata can be thought of as a set
+of "hints": optional information that, if present, provides additional features not
+available when only using the existing physical metadata. See the
+[datasource section](./datasource.md) for information on the use of metadata with
+datasources. Without external metadata, you can use the existing `extern()` table
+function to specify an MSQ input source, format and schema.
+
+### Table Metadata is Extensible
+
+This documentation explains the Druid-defined properties available for datasources
+and external tables. The metadata system is extensible: you can define other
+properties, as long as those properties use names distinct from the Druid-defined
+names. For example, you might want to track the source of the data, or might want
+to include display name or display format properties to help your UI display
+columns.
+
+## Overview of Table Metadata
+
+Here is a typical scenario to explain how table metadata works. Lets say we want
+to create a new datasource. (Although you can use table metadata with an existing
+datasource, this explaination will focus on a new one.) We know our input data is
+not quite what dashboard users expect, so we want to clean things up a bit.
+
+### Design the Datasource
+
+We start with pencil and paper (actually, a text editor or spreadsheet) and we
+list the input columns and their types. Next, we think about the name we want to
+appear for that column in our UI, and the data type we'd prefer. If we are going
+to use Druid's rollup feature, we also want to identify the aggregation we wish
+to use. The result might look like this:
+
+```text
+Input source: web-metrics.csv
+Datasource: webMetrics
+
+    Input Column               Datasource Column
+Name          Type        Name       Type      Aggregation
+----          ----        ----       ----      -----------
+timestamp     ISO string  __time     TIMESTAMP
+host          string      host       VARCHAR
+referer       string
+status        int         httpStatus VARCHAR
+request-size  int         bytesIn    BIGINT    SUM
+response-size int         bytesOut   BIGINT    SUM
+...
+```
+
+Here, we've decided we don't need the referer. For our metrics use, the referer
+might not be helpful, so we just ignore it.
+
+Notice that we use arbitrary types for the CSV input. CSV doesn't really have
+types. We'll need to convert these to SQL types later. For datasource columns,
+we use SQL types, not Druid types. There is a SQL type for each Druid type and
+Druid aggregation. See [Datasource metadata](./datasource.md) for details.
+
+For aggregates, we use a set of special aggregate types described [here](./datasource.md).
+In our case, we use `SUM(BIGINT)` for the aggregate types.
+
+If we are using rollup, we may want to truncate our timestamp. Druid rolls up
+to the millsecond level by default. Perhaps we want to roll up to the second
+level (that is, combine all data for a given host and status that occurs within
+any one second of time.) We do by specifying a parameter to the time data type:
+`TIMESTAMP('PT1S')`.
+
+A Druid datasource needs two additional properties: our time partitioning
+and clustering. [Time partitioning](../multi-stage-query/reference.html#partitioned-by)
+gives the size of the time chunk stored in each segment file. Perhaps our system is of
+moderate size, and we want our segments to hold a day of data, or `P1D`.
+
+We can also choose [clustering](../multi-stage-query/reference.html#clustered-by), which
+is handy as data sizes expand. Suppose we want to cluster on `host`.
+
+We have the option of "sealing" the schema, meaning that we _only_ want these columns
+to be ingested, but no others. We'll do that here. The default option is to be
+unsealed, meaning that ingestion can add new columns at will without the need to first
+declare them in the schema. Use the option that works best for your use case.
+
+We are now ready to define our datasource metadata. We can craft a JSON payload with the
+information (handy if we write a script to do the translation work.) Or, we
+can use the Druid web console. Either way, we create a metadata entry for
+our `webMetrics` datasource that contains the columns and types above.
+
+Here's an example of the JSON payload:
+
+```JSON
+TO DO
+```
+
+If using the REST API, we create the table using the (TODO) API (NEED LINK).
+
+### Define our Input Data
+
+We plan to load new data every day, just after midnight, so our ingestion frequency
+matches our time partitioning. (As our system grows, we will probably ingest more
+frequently, and perhaps shift to streaming ingestion, but let's start simple.)
+We don't want to repeat the same information on each ingest. Instead, we want to
+store that information once and reuse it.
+
+To make things simple, let's assume that Druid runs on only one server, so we can
+load from a local directory. Let's say we always put the data in `/var/metrics/web`
+and we name our directories with the date: `2022-12-02`, etc. Each directory has any
+number of files for that day. Let's further suppose that we know that our inputs are
+always CSV, and always have the columns we listed above.
+
+We model this in Druid by defining an external table that includes everything about
+the input _except_ the set of files (which change each day). Let's call our external
+table `webInput`. All external tables reside in the `ext` schema (namespace).
+
+We again use the (TODO) API, this time with a JSON payload for the eternal table.
+The bulk of the information is simliar to what we use in an MSQ `extern` function:
+the JSON serialized form of an input source and format. But, the input source is
+"partial", we leave off the actual files.
+
+Notice we use SQL types when defining the eternal table. Druid uses this information
+to parse the CSV file coluns into the desired SQL (and Druid) type.
+
+```JSON
+TO DO
+```
+
+### MSQ Ingest
+
+Our next step is to load some data. We'll use MSQ. We've defined our datasource, and
+the input files from which we'll read. We use MSQ to define the mapping from the
+input to the output.
+
+```sql
+INSERT INTO webMetrics
+SELECT
+  TIME_PARSE(timestamp) AS __time,
+  host,
+  CAST(status AS VARCHAR) AS httpStatus,
+  SUM("request-size") AS bytesIn,
+  SUM("response-size") AS bytesOut
+FROM ext.webInput(filePattern => '2202-12-02/*.csv')
+GROUP BY __time, host, status
+```
+
+Some things to notice:
+
+* Druid's `INSERT` statement has a special way to match columns. Unlike standard SQL,
+which matches by position, Druid SQL matches by name. Thus, we must use `AS` clauses
+to ensure our data has the same name as the column in our datasource.
+* We do not include a `TIME_FLOOR` function because Druid inserts that automatically
+based on the `TIMESTAMP('PT5M')` data type.
+* We include a `CAST` to convert `status` (a number) to `httpStatus`
+(as string). Druid will validate the type based on the table metadata.
+* We include the aggregations for the two aggregated fields.
+The aggregations must match those defined in table metadata.
+* We also must explicitly spell out the `GROUP BY` clause which must include all
+non-aggregated columns.
+* Notice that there is no `PARTITIONED BY` or `CLUSTERED BY` clauses: Druid fills those
+in from table metadata.
+
+If we make a mistake, and omit a required column alias, or use the wrong alias, the
+MSQ query will fail. This is part of the data governance idea: Druid just prevented us
+from creating segments that don't satisfy our declared table schema.
+
+### Query and Schema Evolution
+
+We can now query our data. Though it is not obvious in this example, the query experience
+is also guided by table metadata. To see this, suppose that we soon discover that no one
+actually needs the `bytesIn` field: our server doesn't accept `POST` requets and so all
+request are pretty small. To avoid confusing users, we want to remove that column. But,
+we've already ingested data. We could throw away the data and start over. Or, we could
+simply "hide" the unwanted column using the
+[hide columns](api.md) API.
+
+If we now do a `SELECT *` query we find that `bytesIn` is no longer available and we
+no longer have to explain to our users to ignore that column. While hiding a column is
+trivial in this case, imagine if you wanted to make that decision after loading a petabyte
+of data. Hiding a column gives you an immediate solution, even if it might take a long
+time to rewrite (compact) all the existign segments to physicaly remove the column.
+
+We have to change our MSQ ingest statement also to avoid loading any more of the
+now-unused column. This can happen at any time: MSQ ingest won't fail if we try to ingest
+into a hidden column.
+
+In a similar way, if we realize that we actually want `httpStatus` to be `BIGINT`, we
+can change the type, again using the [REST API](api.md).
+
+New ingestions will use the new type. Again, it may be costly to change existing data.
+Drill will, however, automatically `CAST` the old `VARCHAR` (string) data to the

Review Comment:
   Should it be Druid?



##########
docs/catalog/index.md:
##########
@@ -0,0 +1,314 @@
+---
+id: index
+title: Table Metadata Catalog
+sidebar_label: Introduction
+description: Introduces the table metadata features
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+Druid 26.0 introduces the idea of _metadata_ to describe datasources and external
+tables. External table metadata simplifies MSQ ingestion statements. Datasource
+metadata gives you another data governance tool to manage and communicate the data
+stored in a datasource. The user of table metadata is entirely optional: you can
+use it for no tables, some tables or all tables as your needs evolve.
+
+The table metadata feature consists of the main parts:
+
+* A new table in Druid's catalog (also known as the metadata store),
+* A set of new REST APIs to manage table metadata entries,
+* SQL integratoin to put the metadata to work.
+
+Table metadata is an experimental feature in this release. We encourage you to try
+it and to provide feedback. However, since the feature is experimental, some parts
+may change based on feedback.
+
+In this section, we use the term "table" to refer to both datasources and external
+tables. To SQL, these are both tables, and the table metadata feature treats them
+similarly.
+
+## Contents
+
+- [Introduction](./index.md) - Overview of the table metadata feature.
+- [Metadata API](./api.md) - Reference for the REST API operations to create and modify metadata.
+- [Datasource Metadata](./datasource.md) - Details for datasource metadata and partial tables.
+- [External Tables](./ext.md) - Details about external tables used in MSQ ingestion.
+
+
+## Enable Table Metadata Storage
+
+Table metadata storage resides in an extension which is not loaded by default. Instead,
+you must explicitly "opt in" to using this feature. Table metadata store is required
+for the other features described in this section.
+
+Enable table metadata storage by adding `druid-catalog` to your list of loaded
+extensions in `_common/common.runtime.properties`:
+
+```text
+druid.extensions.loadList=[..., "druid-catalog"]
+```
+
+Enabing this extension causes Druid to create the table metadata table and to
+make the table metadata REST APIs available for use.
+
+## When to Use Table Metadata
+
+People use Druid in many ways. When learning Druid, we often load data just to
+see how Druid works and to experiment. For this, "classic" Druid is ideal: just
+load data and go: there is no need for additional datasource metadata.
+
+As we put Druid to work, we craft an application around Druid: a data pipeline
+that feeds data into Druid, UI (dashboards, applications) which pull data out,
+and deployment scripts to keep the system running. At this point, it can be
+useful to design the table schema with as much care as the rest of the system.
+What columns are needed? What column names will be the most clear to the end
+user? What data type should each column have?
+
+Historically, developers encoded these decisions into Druid ingestion specs.
+With MSQ, that information is encoded into SQL statements. To control changes,
+the specs or SQL statements are versioned in Git, subject to reviews when changes
+occur.
+
+This approach works well when there is a single source of data for each datasource.
+It can become difficult to manage when we want to load multiple, different inputs
+into a single datasource, and we want to create a single, composite schema that
+represents all of them. Think, for example, of a metrics application in which
+data arrives from various metrics feeds such as logs and so on. We want to decide
+upon a common set of columns so that, say HTTP status is always stored as a string
+in a column called "http_status", even if some of the inputs use names such as
+"status", "httpCode", or "statusCode". This is, in fact, the classic [ETL
+(extract, transform and load)](https://en.wikipedia.org/wiki/Extract,_transform,_load)
+design challenge.
+
+In this case, having a known, defined datasource schema becomes essential. The job
+of each of many ingestion specs or MSQ queries is to transform the incoming data from
+the format in which it appears into the form we've designed for our datasource.
+While table metadata is generally useful, it is this many-to-one mapping scenario
+where it finds its greatest need.
+
+### Table Metadata is Optional
+
+Table metadata is a new feature. Existing Druid systems have physical datasource
+metadata, but no logical data. As a result, table metdata can be thought of as a set
+of "hints": optional information that, if present, provides additional features not
+available when only using the existing physical metadata. See the
+[datasource section](./datasource.md) for information on the use of metadata with
+datasources. Without external metadata, you can use the existing `extern()` table
+function to specify an MSQ input source, format and schema.
+
+### Table Metadata is Extensible
+
+This documentation explains the Druid-defined properties available for datasources
+and external tables. The metadata system is extensible: you can define other
+properties, as long as those properties use names distinct from the Druid-defined
+names. For example, you might want to track the source of the data, or might want
+to include display name or display format properties to help your UI display
+columns.
+
+## Overview of Table Metadata
+
+Here is a typical scenario to explain how table metadata works. Lets say we want
+to create a new datasource. (Although you can use table metadata with an existing
+datasource, this explaination will focus on a new one.) We know our input data is
+not quite what dashboard users expect, so we want to clean things up a bit.
+
+### Design the Datasource
+
+We start with pencil and paper (actually, a text editor or spreadsheet) and we
+list the input columns and their types. Next, we think about the name we want to
+appear for that column in our UI, and the data type we'd prefer. If we are going
+to use Druid's rollup feature, we also want to identify the aggregation we wish
+to use. The result might look like this:
+
+```text
+Input source: web-metrics.csv
+Datasource: webMetrics
+
+    Input Column               Datasource Column
+Name          Type        Name       Type      Aggregation
+----          ----        ----       ----      -----------
+timestamp     ISO string  __time     TIMESTAMP
+host          string      host       VARCHAR
+referer       string
+status        int         httpStatus VARCHAR
+request-size  int         bytesIn    BIGINT    SUM
+response-size int         bytesOut   BIGINT    SUM
+...
+```
+
+Here, we've decided we don't need the referer. For our metrics use, the referer
+might not be helpful, so we just ignore it.
+
+Notice that we use arbitrary types for the CSV input. CSV doesn't really have
+types. We'll need to convert these to SQL types later. For datasource columns,
+we use SQL types, not Druid types. There is a SQL type for each Druid type and
+Druid aggregation. See [Datasource metadata](./datasource.md) for details.
+
+For aggregates, we use a set of special aggregate types described [here](./datasource.md).
+In our case, we use `SUM(BIGINT)` for the aggregate types.
+
+If we are using rollup, we may want to truncate our timestamp. Druid rolls up
+to the millsecond level by default. Perhaps we want to roll up to the second
+level (that is, combine all data for a given host and status that occurs within
+any one second of time.) We do by specifying a parameter to the time data type:
+`TIMESTAMP('PT1S')`.
+
+A Druid datasource needs two additional properties: our time partitioning
+and clustering. [Time partitioning](../multi-stage-query/reference.html#partitioned-by)
+gives the size of the time chunk stored in each segment file. Perhaps our system is of
+moderate size, and we want our segments to hold a day of data, or `P1D`.
+
+We can also choose [clustering](../multi-stage-query/reference.html#clustered-by), which
+is handy as data sizes expand. Suppose we want to cluster on `host`.
+
+We have the option of "sealing" the schema, meaning that we _only_ want these columns
+to be ingested, but no others. We'll do that here. The default option is to be
+unsealed, meaning that ingestion can add new columns at will without the need to first
+declare them in the schema. Use the option that works best for your use case.
+
+We are now ready to define our datasource metadata. We can craft a JSON payload with the
+information (handy if we write a script to do the translation work.) Or, we
+can use the Druid web console. Either way, we create a metadata entry for
+our `webMetrics` datasource that contains the columns and types above.
+
+Here's an example of the JSON payload:
+
+```JSON
+TO DO
+```
+
+If using the REST API, we create the table using the (TODO) API (NEED LINK).
+
+### Define our Input Data
+
+We plan to load new data every day, just after midnight, so our ingestion frequency
+matches our time partitioning. (As our system grows, we will probably ingest more
+frequently, and perhaps shift to streaming ingestion, but let's start simple.)
+We don't want to repeat the same information on each ingest. Instead, we want to
+store that information once and reuse it.
+
+To make things simple, let's assume that Druid runs on only one server, so we can
+load from a local directory. Let's say we always put the data in `/var/metrics/web`
+and we name our directories with the date: `2022-12-02`, etc. Each directory has any
+number of files for that day. Let's further suppose that we know that our inputs are
+always CSV, and always have the columns we listed above.
+
+We model this in Druid by defining an external table that includes everything about
+the input _except_ the set of files (which change each day). Let's call our external
+table `webInput`. All external tables reside in the `ext` schema (namespace).
+
+We again use the (TODO) API, this time with a JSON payload for the eternal table.
+The bulk of the information is simliar to what we use in an MSQ `extern` function:
+the JSON serialized form of an input source and format. But, the input source is
+"partial", we leave off the actual files.
+
+Notice we use SQL types when defining the eternal table. Druid uses this information
+to parse the CSV file coluns into the desired SQL (and Druid) type.
+
+```JSON
+TO DO
+```
+
+### MSQ Ingest
+
+Our next step is to load some data. We'll use MSQ. We've defined our datasource, and
+the input files from which we'll read. We use MSQ to define the mapping from the
+input to the output.
+
+```sql
+INSERT INTO webMetrics
+SELECT
+  TIME_PARSE(timestamp) AS __time,
+  host,
+  CAST(status AS VARCHAR) AS httpStatus,
+  SUM("request-size") AS bytesIn,
+  SUM("response-size") AS bytesOut
+FROM ext.webInput(filePattern => '2202-12-02/*.csv')
+GROUP BY __time, host, status
+```
+
+Some things to notice:
+
+* Druid's `INSERT` statement has a special way to match columns. Unlike standard SQL,
+which matches by position, Druid SQL matches by name. Thus, we must use `AS` clauses
+to ensure our data has the same name as the column in our datasource.
+* We do not include a `TIME_FLOOR` function because Druid inserts that automatically
+based on the `TIMESTAMP('PT5M')` data type.
+* We include a `CAST` to convert `status` (a number) to `httpStatus`
+(as string). Druid will validate the type based on the table metadata.
+* We include the aggregations for the two aggregated fields.
+The aggregations must match those defined in table metadata.
+* We also must explicitly spell out the `GROUP BY` clause which must include all
+non-aggregated columns.
+* Notice that there is no `PARTITIONED BY` or `CLUSTERED BY` clauses: Druid fills those
+in from table metadata.
+
+If we make a mistake, and omit a required column alias, or use the wrong alias, the
+MSQ query will fail. This is part of the data governance idea: Druid just prevented us
+from creating segments that don't satisfy our declared table schema.
+
+### Query and Schema Evolution
+
+We can now query our data. Though it is not obvious in this example, the query experience
+is also guided by table metadata. To see this, suppose that we soon discover that no one
+actually needs the `bytesIn` field: our server doesn't accept `POST` requets and so all
+request are pretty small. To avoid confusing users, we want to remove that column. But,
+we've already ingested data. We could throw away the data and start over. Or, we could
+simply "hide" the unwanted column using the
+[hide columns](api.md) API.
+
+If we now do a `SELECT *` query we find that `bytesIn` is no longer available and we
+no longer have to explain to our users to ignore that column. While hiding a column is
+trivial in this case, imagine if you wanted to make that decision after loading a petabyte
+of data. Hiding a column gives you an immediate solution, even if it might take a long
+time to rewrite (compact) all the existign segments to physicaly remove the column.
+
+We have to change our MSQ ingest statement also to avoid loading any more of the
+now-unused column. This can happen at any time: MSQ ingest won't fail if we try to ingest
+into a hidden column.
+
+In a similar way, if we realize that we actually want `httpStatus` to be `BIGINT`, we
+can change the type, again using the [REST API](api.md).
+
+New ingestions will use the new type. Again, it may be costly to change existing data.
+Drill will, however, automatically `CAST` the old `VARCHAR` (string) data to the
+new `BIGINT` type, allow us to make the change immediately.
+
+## Details
+
+Now that we understand what table metadata is, and how we might use it, we are ready
+to dive into the technical reference material which follows.
+
+## Limitations
+
+This is the first version of the catalog feature. A number of limitations are known:
+
+* When using the catalog to specify column types for a table ingested via MSQ, the
+  resulting types are determined by the `SELECT` list, not the table schema. Druid will
+  ensure that the actual type is compatible with the declared type, the the actual type
+  may still differ from the declared type. Use a `CAST` to ensure the types are as
+  desiried.

Review Comment:
   ```suggestion
     desired.
   ```



##########
extensions-core/multi-stage-query/src/main/java/org/apache/druid/msq/exec/ControllerImpl.java:
##########
@@ -1868,52 +1872,91 @@ private static Pair<List<DimensionSchema>, List<AggregatorFactory>> makeDimensio
 
     // Each column can be of either time, dimension, aggregator. For this method. we can ignore the time column.
     // For non-complex columns, If the aggregator factory of the column is not available, we treat the column as
-    // a dimension. For complex columns, certains hacks are in place.
+    // a dimension. For complex columns, certain hacks are in place.
     for (final String outputColumn : outputColumnsInOrder) {
-      final String queryColumn = columnMappings.getQueryColumnForOutputColumn(outputColumn);
-      final ColumnType type =
-          querySignature.getColumnType(queryColumn)
-                        .orElseThrow(() -> new ISE("No type for column [%s]", outputColumn));
-
-      if (!outputColumn.equals(ColumnHolder.TIME_COLUMN_NAME)) {
-
-        if (!type.is(ValueType.COMPLEX)) {
-          // non complex columns
-          populateDimensionsAndAggregators(
-              dimensions,
-              aggregators,
-              outputColumnAggregatorFactories,
-              outputColumn,
-              type
-          );
-        } else {
-          // complex columns only
-          if (DimensionHandlerUtils.DIMENSION_HANDLER_PROVIDERS.containsKey(type.getComplexTypeName())) {
-            dimensions.add(DimensionSchemaUtils.createDimensionSchema(outputColumn, type));
-          } else if (!isRollupQuery) {
-            aggregators.add(new PassthroughAggregatorFactory(outputColumn, type.getComplexTypeName()));
-          } else {
-            populateDimensionsAndAggregators(
-                dimensions,
-                aggregators,
-                outputColumnAggregatorFactories,
-                outputColumn,
-                type
-            );
-          }
-        }
+      final ColumnType type = computeStorageType(storageTypes, querySignature, columnMappings, outputColumn);
+
+      if (outputColumn.equals(ColumnHolder.TIME_COLUMN_NAME)) {
+        // Skip
+      } else if (!type.is(ValueType.COMPLEX)) {
+        // non complex columns
+        populateDimensionsAndAggregators(
+            dimensions,
+            aggregators,
+            outputColumnAggregatorFactories,
+            outputColumn,
+            type
+        );
+     // complex columns only
+      } else if (DimensionHandlerUtils.DIMENSION_HANDLER_PROVIDERS.containsKey(type.getComplexTypeName())) {
+        dimensions.add(DimensionSchemaUtils.createDimensionSchema(outputColumn, type));
+      } else if (!isRollupQuery) {
+        aggregators.add(new PassthroughAggregatorFactory(outputColumn, type.getComplexTypeName()));
+      } else {
+        populateDimensionsAndAggregators(
+            dimensions,
+            aggregators,
+            outputColumnAggregatorFactories,
+            outputColumn,
+            type
+        );
       }
     }
 
     return Pair.of(dimensions, aggregators);
   }
 
+  // TODO: This is a starter set of basic types.
+  private static final Map<String, ColumnType> COLUMN_TYPES = ImmutableMap.<String, ColumnType>builder()
+        .put(ColumnType.STRING.asTypeString(), ColumnType.STRING)
+        .put(ColumnType.LONG.asTypeString(), ColumnType.LONG)
+        .put(ColumnType.FLOAT.asTypeString(), ColumnType.FLOAT)
+        .put(ColumnType.DOUBLE.asTypeString(), ColumnType.DOUBLE)
+        .build();
+
+  /**
+   * Compute the storage (segment column) type for the given output column. The
+   * storage type is either provided explicitly, or is inferred from the query.
+   * <p>
+   * When the storage type differs from the query output type, the two must be
+   * compatible. That is, the appenderator must be able to convert from the query
+   * type to the storage type. The Calcite validator will have done this check,
+   * so that, by the time we get here, we can assume any required conversions
+   * are supported.
+   */
+  private static ColumnType computeStorageType(
+      final Map<String, String> storageTypes,
+      final RowSignature querySignature,
+      final ColumnMappings columnMappings,
+      final String outputColumn
+  )
+  {
+    // If output types are provided, and if this column has a type, and that type is valid,
+    // then use that type as the output type. The output type comes from a table schema
+    // declaration, such as the Druid catalog.
+    if (storageTypes != null) {
+      String outputType = storageTypes.get(outputColumn);
+      if (outputType != null) {
+        ColumnType type = COLUMN_TYPES.get(StringUtils.toUpperCase(outputType));
+        if (type != null) {
+          return type;
+        }

Review Comment:
   In case `type==null`, should an exception be thrown instead of defaulting to inferring the type from the signature? 



##########
docs/catalog/index.md:
##########
@@ -0,0 +1,314 @@
+---
+id: index
+title: Table Metadata Catalog
+sidebar_label: Introduction
+description: Introduces the table metadata features
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+Druid 26.0 introduces the idea of _metadata_ to describe datasources and external
+tables. External table metadata simplifies MSQ ingestion statements. Datasource
+metadata gives you another data governance tool to manage and communicate the data
+stored in a datasource. The user of table metadata is entirely optional: you can
+use it for no tables, some tables or all tables as your needs evolve.
+
+The table metadata feature consists of the main parts:
+
+* A new table in Druid's catalog (also known as the metadata store),
+* A set of new REST APIs to manage table metadata entries,
+* SQL integratoin to put the metadata to work.
+
+Table metadata is an experimental feature in this release. We encourage you to try
+it and to provide feedback. However, since the feature is experimental, some parts
+may change based on feedback.
+
+In this section, we use the term "table" to refer to both datasources and external
+tables. To SQL, these are both tables, and the table metadata feature treats them
+similarly.
+
+## Contents
+
+- [Introduction](./index.md) - Overview of the table metadata feature.
+- [Metadata API](./api.md) - Reference for the REST API operations to create and modify metadata.
+- [Datasource Metadata](./datasource.md) - Details for datasource metadata and partial tables.
+- [External Tables](./ext.md) - Details about external tables used in MSQ ingestion.
+
+
+## Enable Table Metadata Storage
+
+Table metadata storage resides in an extension which is not loaded by default. Instead,
+you must explicitly "opt in" to using this feature. Table metadata store is required
+for the other features described in this section.
+
+Enable table metadata storage by adding `druid-catalog` to your list of loaded
+extensions in `_common/common.runtime.properties`:
+
+```text
+druid.extensions.loadList=[..., "druid-catalog"]
+```
+
+Enabing this extension causes Druid to create the table metadata table and to
+make the table metadata REST APIs available for use.
+
+## When to Use Table Metadata
+
+People use Druid in many ways. When learning Druid, we often load data just to
+see how Druid works and to experiment. For this, "classic" Druid is ideal: just
+load data and go: there is no need for additional datasource metadata.
+
+As we put Druid to work, we craft an application around Druid: a data pipeline
+that feeds data into Druid, UI (dashboards, applications) which pull data out,
+and deployment scripts to keep the system running. At this point, it can be
+useful to design the table schema with as much care as the rest of the system.
+What columns are needed? What column names will be the most clear to the end
+user? What data type should each column have?
+
+Historically, developers encoded these decisions into Druid ingestion specs.
+With MSQ, that information is encoded into SQL statements. To control changes,
+the specs or SQL statements are versioned in Git, subject to reviews when changes
+occur.
+
+This approach works well when there is a single source of data for each datasource.
+It can become difficult to manage when we want to load multiple, different inputs
+into a single datasource, and we want to create a single, composite schema that
+represents all of them. Think, for example, of a metrics application in which
+data arrives from various metrics feeds such as logs and so on. We want to decide
+upon a common set of columns so that, say HTTP status is always stored as a string
+in a column called "http_status", even if some of the inputs use names such as
+"status", "httpCode", or "statusCode". This is, in fact, the classic [ETL
+(extract, transform and load)](https://en.wikipedia.org/wiki/Extract,_transform,_load)
+design challenge.
+
+In this case, having a known, defined datasource schema becomes essential. The job
+of each of many ingestion specs or MSQ queries is to transform the incoming data from
+the format in which it appears into the form we've designed for our datasource.
+While table metadata is generally useful, it is this many-to-one mapping scenario
+where it finds its greatest need.
+
+### Table Metadata is Optional
+
+Table metadata is a new feature. Existing Druid systems have physical datasource
+metadata, but no logical data. As a result, table metdata can be thought of as a set
+of "hints": optional information that, if present, provides additional features not
+available when only using the existing physical metadata. See the
+[datasource section](./datasource.md) for information on the use of metadata with
+datasources. Without external metadata, you can use the existing `extern()` table
+function to specify an MSQ input source, format and schema.
+
+### Table Metadata is Extensible
+
+This documentation explains the Druid-defined properties available for datasources
+and external tables. The metadata system is extensible: you can define other
+properties, as long as those properties use names distinct from the Druid-defined
+names. For example, you might want to track the source of the data, or might want
+to include display name or display format properties to help your UI display
+columns.
+
+## Overview of Table Metadata
+
+Here is a typical scenario to explain how table metadata works. Lets say we want
+to create a new datasource. (Although you can use table metadata with an existing
+datasource, this explaination will focus on a new one.) We know our input data is
+not quite what dashboard users expect, so we want to clean things up a bit.
+
+### Design the Datasource
+
+We start with pencil and paper (actually, a text editor or spreadsheet) and we
+list the input columns and their types. Next, we think about the name we want to
+appear for that column in our UI, and the data type we'd prefer. If we are going
+to use Druid's rollup feature, we also want to identify the aggregation we wish
+to use. The result might look like this:
+
+```text
+Input source: web-metrics.csv
+Datasource: webMetrics
+
+    Input Column               Datasource Column
+Name          Type        Name       Type      Aggregation
+----          ----        ----       ----      -----------
+timestamp     ISO string  __time     TIMESTAMP
+host          string      host       VARCHAR
+referer       string
+status        int         httpStatus VARCHAR
+request-size  int         bytesIn    BIGINT    SUM
+response-size int         bytesOut   BIGINT    SUM
+...
+```
+
+Here, we've decided we don't need the referer. For our metrics use, the referer
+might not be helpful, so we just ignore it.
+
+Notice that we use arbitrary types for the CSV input. CSV doesn't really have
+types. We'll need to convert these to SQL types later. For datasource columns,
+we use SQL types, not Druid types. There is a SQL type for each Druid type and
+Druid aggregation. See [Datasource metadata](./datasource.md) for details.
+
+For aggregates, we use a set of special aggregate types described [here](./datasource.md).
+In our case, we use `SUM(BIGINT)` for the aggregate types.
+
+If we are using rollup, we may want to truncate our timestamp. Druid rolls up
+to the millsecond level by default. Perhaps we want to roll up to the second
+level (that is, combine all data for a given host and status that occurs within
+any one second of time.) We do by specifying a parameter to the time data type:
+`TIMESTAMP('PT1S')`.
+
+A Druid datasource needs two additional properties: our time partitioning
+and clustering. [Time partitioning](../multi-stage-query/reference.html#partitioned-by)
+gives the size of the time chunk stored in each segment file. Perhaps our system is of
+moderate size, and we want our segments to hold a day of data, or `P1D`.
+
+We can also choose [clustering](../multi-stage-query/reference.html#clustered-by), which
+is handy as data sizes expand. Suppose we want to cluster on `host`.
+
+We have the option of "sealing" the schema, meaning that we _only_ want these columns
+to be ingested, but no others. We'll do that here. The default option is to be
+unsealed, meaning that ingestion can add new columns at will without the need to first
+declare them in the schema. Use the option that works best for your use case.
+
+We are now ready to define our datasource metadata. We can craft a JSON payload with the
+information (handy if we write a script to do the translation work.) Or, we
+can use the Druid web console. Either way, we create a metadata entry for
+our `webMetrics` datasource that contains the columns and types above.
+
+Here's an example of the JSON payload:
+
+```JSON
+TO DO
+```
+
+If using the REST API, we create the table using the (TODO) API (NEED LINK).
+
+### Define our Input Data
+
+We plan to load new data every day, just after midnight, so our ingestion frequency
+matches our time partitioning. (As our system grows, we will probably ingest more
+frequently, and perhaps shift to streaming ingestion, but let's start simple.)
+We don't want to repeat the same information on each ingest. Instead, we want to
+store that information once and reuse it.
+
+To make things simple, let's assume that Druid runs on only one server, so we can
+load from a local directory. Let's say we always put the data in `/var/metrics/web`
+and we name our directories with the date: `2022-12-02`, etc. Each directory has any
+number of files for that day. Let's further suppose that we know that our inputs are
+always CSV, and always have the columns we listed above.
+
+We model this in Druid by defining an external table that includes everything about
+the input _except_ the set of files (which change each day). Let's call our external
+table `webInput`. All external tables reside in the `ext` schema (namespace).
+
+We again use the (TODO) API, this time with a JSON payload for the eternal table.
+The bulk of the information is simliar to what we use in an MSQ `extern` function:
+the JSON serialized form of an input source and format. But, the input source is
+"partial", we leave off the actual files.
+
+Notice we use SQL types when defining the eternal table. Druid uses this information
+to parse the CSV file coluns into the desired SQL (and Druid) type.
+
+```JSON
+TO DO
+```
+
+### MSQ Ingest
+
+Our next step is to load some data. We'll use MSQ. We've defined our datasource, and
+the input files from which we'll read. We use MSQ to define the mapping from the
+input to the output.
+
+```sql
+INSERT INTO webMetrics
+SELECT
+  TIME_PARSE(timestamp) AS __time,
+  host,
+  CAST(status AS VARCHAR) AS httpStatus,
+  SUM("request-size") AS bytesIn,
+  SUM("response-size") AS bytesOut
+FROM ext.webInput(filePattern => '2202-12-02/*.csv')
+GROUP BY __time, host, status
+```
+
+Some things to notice:
+
+* Druid's `INSERT` statement has a special way to match columns. Unlike standard SQL,
+which matches by position, Druid SQL matches by name. Thus, we must use `AS` clauses
+to ensure our data has the same name as the column in our datasource.
+* We do not include a `TIME_FLOOR` function because Druid inserts that automatically
+based on the `TIMESTAMP('PT5M')` data type.
+* We include a `CAST` to convert `status` (a number) to `httpStatus`
+(as string). Druid will validate the type based on the table metadata.
+* We include the aggregations for the two aggregated fields.
+The aggregations must match those defined in table metadata.
+* We also must explicitly spell out the `GROUP BY` clause which must include all
+non-aggregated columns.
+* Notice that there is no `PARTITIONED BY` or `CLUSTERED BY` clauses: Druid fills those
+in from table metadata.
+
+If we make a mistake, and omit a required column alias, or use the wrong alias, the
+MSQ query will fail. This is part of the data governance idea: Druid just prevented us
+from creating segments that don't satisfy our declared table schema.
+
+### Query and Schema Evolution
+
+We can now query our data. Though it is not obvious in this example, the query experience
+is also guided by table metadata. To see this, suppose that we soon discover that no one
+actually needs the `bytesIn` field: our server doesn't accept `POST` requets and so all
+request are pretty small. To avoid confusing users, we want to remove that column. But,
+we've already ingested data. We could throw away the data and start over. Or, we could
+simply "hide" the unwanted column using the
+[hide columns](api.md) API.
+
+If we now do a `SELECT *` query we find that `bytesIn` is no longer available and we
+no longer have to explain to our users to ignore that column. While hiding a column is
+trivial in this case, imagine if you wanted to make that decision after loading a petabyte
+of data. Hiding a column gives you an immediate solution, even if it might take a long
+time to rewrite (compact) all the existign segments to physicaly remove the column.
+
+We have to change our MSQ ingest statement also to avoid loading any more of the
+now-unused column. This can happen at any time: MSQ ingest won't fail if we try to ingest
+into a hidden column.
+
+In a similar way, if we realize that we actually want `httpStatus` to be `BIGINT`, we
+can change the type, again using the [REST API](api.md).
+
+New ingestions will use the new type. Again, it may be costly to change existing data.
+Drill will, however, automatically `CAST` the old `VARCHAR` (string) data to the
+new `BIGINT` type, allow us to make the change immediately.
+
+## Details
+
+Now that we understand what table metadata is, and how we might use it, we are ready
+to dive into the technical reference material which follows.
+
+## Limitations
+
+This is the first version of the catalog feature. A number of limitations are known:
+
+* When using the catalog to specify column types for a table ingested via MSQ, the
+  resulting types are determined by the `SELECT` list, not the table schema. Druid will
+  ensure that the actual type is compatible with the declared type, the the actual type
+  may still differ from the declared type. Use a `CAST` to ensure the types are as
+  desiried.
+* The present version supports only the S3 cloud data source.
+* Catalog information is used only for MSQ ingestion, but not for classic batch or

Review Comment:
   Should this limitation be mentioned in one of the intro paragraphs as well? One of the main doubts that I had while reading the doc was how would the features described be applicable in case of native ingestion. I realized that the catalog will aid MSQ so my doubt is moot. 



##########
docs/catalog/ext.md:
##########
@@ -0,0 +1,491 @@
+---
+id: ext
+title: External Tables
+sidebar_label: External Tables
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## External Tables
+
+Druid ingests data from external sources. Druid has three main ways to perform
+ingestion:
+
+* [MSQ](../multi-stage-query/reference.md#extern) (SQL-based ingestion)
+* "Classic" [batch ingestion](../ingestion/ingestion-spec.md#ioconfig)
+* Streaming ingestion
+
+In each cases, Druid reads from an external source defined by a triple of
+input source, input format and (with MSQ) input signature. MSQ uses SQL
+to define ingestion, and SQL works with tables. Thus, with MSQ, we call
+the external source an _external table_. Druid external tables are much
+like those of similar systems (such as Hive), except that Druid, as a
+database, never reads the same data twice. This leads to the idea that
+Druid external tables are most often _parameterized_: we provide at least
+the names of the files (or objects) to read on each ingestion.
+
+Druid provides three levels of parameterization:
+
+* An ad-hoc table, such as those defined by the `extern` and similar
+  table functions, in which you specify all attributes that define the table.
+* A _partial table_, defined in the Druid catalog, that specifies some of the
+  table information (perhaps the S3 configuration and perhaps the file format
+  and column layout), leaving the rest to be provided via a named table function
+  at ingest time.
+* A fully-defined table in which all information is provided in the catalog, and
+  the table acts like a regular SQL table at run time. (Such tables are generally
+  useful in Druid for testing or when using MSQ to run queries. As noted above,
+  this format is less useful for ingestion.)
+
+See [MSQ](../multi-stage-query/reference.md) for how to define an ad-hoc table.
+Here we focus on the use
+of table metadata in the Druid catalog. An external table is just another kind
+of table in Druid's catalog: you use the same REST API methods to define and
+retrieve information about external tables as you would to wor with datasource
+tables.
+
+External tables do, however, have a unique set of properties. External tables
+specifications reside in the `ext` schema within Druid.
+
+## External Table Specificaation
+
+An external table has the type `external`, and has two key properties:
+`source` and `format`.
+
+### `description` (String)
+
+The description property hold human-readable text to describe the external table.
+Druid does not yet use this property, though the intent is that the Druid Console
+will display this text in some useful way.
+
+```json
+{
+  "type" : "extern",
+  "properties" : {
+    "description" : "Logs from Apache web servers",
+    ...
+  },
+}
+```
+
+### `source` (InputSource object)
+
+The `source` property is the JSON serialialized form of a Druid `InputSource`
+object as described on the [data formats page](../ingestion/data-formats.md).
+The JSON here is identical to that
+you would use as the first argument to an MSQ `extern` function. As explained
+below, you will often leave off some of the properties in order to define
+a partial table for ingestion. See the sections on each input source for examples.
+
+### `format` (InputFormat object)
+
+The `format` property is the JSON serialized form of a Druid `InputFormat`
+object as described here (LINK NEEDED).  The JSON here is identical to that
+you would use as the first argument to an MSQ `extern` function. The format
+is optional. If provided in the table specification, then it need not be
+repeated at ingest time. Use this choice when you have a set of
+identically-formatted files you read on each ingest, such as reading today's
+batch of web server log files in which the format itself stays constant.
+
+Example:
+
+```json
+{
+  "type" : "extern",
+  "properties" : {
+    "format" : {
+      "type" : "csv",
+      "listDelimiter" : ";"
+    },
+    ...
+}
+```
+
+You can also omit the format, which case you must provide it at ingest time.
+Use this option if you want to define a _connection_: say the properties of
+an S3 bucket, but you store multiple file types and formats in that bucket,
+and so have to provide the specific format at ingest time.
+
+When you do provide a format, the format is complete. Unlike the source,
+you either provide an entire `InputFormat` spec, or none at all.
+
+### External Columns
+
+The set of external table columns can be provided in the table specification
+in the catalog entry, or at run time. To provide columns in the table
+specification, provide both a name and a SQL type:
+
+```json
+{
+  "type" : "extern",
+  "properties" : {
+    ...
+  },
+  "columns" : [ {
+    "name" : "timetamp",
+    "sqlType" : "VARCHAR"
+  }, {
+    "name" : "sendBytes",
+    "sqlType" : "BIGINT"
+  } ]
+}
+```
+
+External tables cannot use Druid aggregate types.
+
+### Using a Complete Table Specification
+
+Suppose we wish to define an external table for the Wikipedia data, which
+is available on our local system at `$DRUID_HOME/quickstart/tutorial`. Let us assume
+Druid is installed at `/Users/bob/druid`. The data uses
+is compressed JSON. The complete table specification looks like this:
+
+```json
+{
+  "type" : "extern",
+  "properties" : {
+    "format" : {
+      "type" : "json",
+      "keepNullColumns" : true,
+      "assumeNewlineDelimited" : true,
+      "useJsonNodeReader" : false
+    },
+    "description" : "Sample Wikipedia data",
+    "source" : {
+      "type" : "local",
+      "baseDir" : "/Users/bob/druid/quickstart/tutorial",
+      "filter" : "wikiticker-*-sampled.json"
+    }
+  },
+  "columns" : [ {
+    "name" : "timetamp",
+    "sqlType" : "VARCHAR"
+  }, {
+    "name" : "page",
+    "sqlType" : "VARCHAR"
+  }, {
+    "name" : "language",
+    "sqlType" : "VARCHAR"
+  }, {
+    "name" : "unpatrolled",
+    "sqlType" : "VARCHAR"
+  }, {
+    "name" : "newPage",
+    "sqlType" : "VARCHAR"
+  }, {
+    "name" : "robot",
+    "sqlType" : "VARCHAR"
+  }, {
+    "name" : "namespace",
+    "sqlType" : "BIGINT"
+  }, ... {
+    "name" : "added",
+    "sqlType" : "VARCHAR"
+  }, {
+    "name" : "deleted",
+    "sqlType" : "BIGINT"
+  }, {
+    "name" : "delta",
+    "sqlType" : "BIGINT"
+  } ]
+}
+```
+
+Let us call the table `wikipedia`. As noted above, the table resides in
+the `ext` schema.
+
+Then, we can ingest the data, into a datasource also named `wikipedia`, using MSQ as follows:
+
+```sql
+INSERT INTO wikipedia
+SELECT *
+FROM ext.wikipedia
+```
+
+The real intestion would do some data transforms such as converting the "TRUE"/"FALSE" fields
+into 1/0 values, parse the date-time field and so on. Here we just focus on the table
+metadata aspect.
+
+### Using a Partial Table Specication
+
+As noted, we can omit some of the information from the catalog entry to
+produce a _partial_ table specification. Suppose we have an HTTP input source
+that provides a set of files: the file contents differ each day, but all have
+the same format and schema. We define our `koala` catalog table as follows:
+
+```json
+{
+  "type" : "extern",
+  "properties" : {
+    "uriTemplate" : "https://example.com/{}",
+    "format" : {
+      "type" : "csv"
+    },
+    "description" : "Example parameterized external table",
+    "source" : {
+      "type" : "http",
+      "httpAuthenticationUsername" : "bob",
+      "httpAuthenticationPassword" : {
+        "type" : "default",
+        "password" : "secret"
+      }
+    }
+  },
+  "columns" : [ {
+    "name" : "timetamp",
+    "sqlType" : "VARCHAR"
+  }, {
+    "name" : "metric",
+    "sqlType" : "VARCHAR"
+  }, {
+    "name" : "value",
+    "sqlType" : "BIGINT"
+  } ]
+}
+```
+
+We ingest table using the table name as if it were a function. We provide
+the missing inforation:
+
+```sql
+INSERT INTO koalaMetrics
+SELECT *
+FROM TABLE(ext.koalas(uris => 'Dec05.json, Dec06.json'))
+```
+
+On the other hand, suppose we wanted to define something more like a
+connection: we provide the location of a web server, and credentials, but
+the web server hosts a variety of files. Suppose we have a `dataHub`
+web site that holds a variety of files.
+
+```json
+{
+  "type" : "extern",
+  "properties" : {
+    "description" : "Example connection",
+    "source" : {
+      "type" : "http",
+      "uris" : [ "https://example.com/" ],
+      "httpAuthenticationUsername" : "bob",
+      "httpAuthenticationPassword" : {
+        "type" : "default",
+        "password" : "secret"
+      }
+    }
+  },
+  "columns" : [ ]
+}
+```
+
+Providing the file names plus the format and schema in the ingest query:
+
+```sql
+SELECT *
+FROM TABLE(ext.dataHub(
+  uris => 'DecLogs.csv',
+  format => 'csv'))
+  (timestamp VARCHAR, host VARCHAR, requests BIGINT)
+```
+
+## Input Sources
+
+An external table can use any of Druid's input sources, including any that extensions
+provide. Druid's own input sources have a few extra features to allow parameterized
+tables; extensions may provide such functions or not, depending on the extension.
+
+### Inline Input Source
+
+The [inline input source](../ingestion/native-batch-input-sources.md#inline-input-source)
+ is normally used interally by the Druid SQL planner. It is
+primarly useful with ingest as a simple way to experiment with the catalog and MSQ.
+You can use the inline input source ad-hoc by using the `inline` table function.
+
+You can also define an inline input source in the Druid catalog.
+The inline input source cannot be parameterized: you must provide the complete input
+specificataion, format specfication and schema as part of the catlog definition.

Review Comment:
   ```suggestion
   specification, format specification, and schema as part of the catalog definition.
   ```



##########
docs/catalog/index.md:
##########
@@ -0,0 +1,314 @@
+---
+id: index
+title: Table Metadata Catalog
+sidebar_label: Introduction
+description: Introduces the table metadata features
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+Druid 26.0 introduces the idea of _metadata_ to describe datasources and external
+tables. External table metadata simplifies MSQ ingestion statements. Datasource
+metadata gives you another data governance tool to manage and communicate the data
+stored in a datasource. The user of table metadata is entirely optional: you can
+use it for no tables, some tables or all tables as your needs evolve.
+
+The table metadata feature consists of the main parts:
+
+* A new table in Druid's catalog (also known as the metadata store),
+* A set of new REST APIs to manage table metadata entries,
+* SQL integratoin to put the metadata to work.
+
+Table metadata is an experimental feature in this release. We encourage you to try
+it and to provide feedback. However, since the feature is experimental, some parts
+may change based on feedback.
+
+In this section, we use the term "table" to refer to both datasources and external
+tables. To SQL, these are both tables, and the table metadata feature treats them
+similarly.
+
+## Contents
+
+- [Introduction](./index.md) - Overview of the table metadata feature.
+- [Metadata API](./api.md) - Reference for the REST API operations to create and modify metadata.
+- [Datasource Metadata](./datasource.md) - Details for datasource metadata and partial tables.
+- [External Tables](./ext.md) - Details about external tables used in MSQ ingestion.
+
+
+## Enable Table Metadata Storage
+
+Table metadata storage resides in an extension which is not loaded by default. Instead,
+you must explicitly "opt in" to using this feature. Table metadata store is required
+for the other features described in this section.
+
+Enable table metadata storage by adding `druid-catalog` to your list of loaded
+extensions in `_common/common.runtime.properties`:
+
+```text
+druid.extensions.loadList=[..., "druid-catalog"]
+```
+
+Enabing this extension causes Druid to create the table metadata table and to
+make the table metadata REST APIs available for use.
+
+## When to Use Table Metadata
+
+People use Druid in many ways. When learning Druid, we often load data just to
+see how Druid works and to experiment. For this, "classic" Druid is ideal: just
+load data and go: there is no need for additional datasource metadata.
+
+As we put Druid to work, we craft an application around Druid: a data pipeline
+that feeds data into Druid, UI (dashboards, applications) which pull data out,
+and deployment scripts to keep the system running. At this point, it can be
+useful to design the table schema with as much care as the rest of the system.
+What columns are needed? What column names will be the most clear to the end
+user? What data type should each column have?
+
+Historically, developers encoded these decisions into Druid ingestion specs.
+With MSQ, that information is encoded into SQL statements. To control changes,
+the specs or SQL statements are versioned in Git, subject to reviews when changes
+occur.
+
+This approach works well when there is a single source of data for each datasource.
+It can become difficult to manage when we want to load multiple, different inputs
+into a single datasource, and we want to create a single, composite schema that
+represents all of them. Think, for example, of a metrics application in which
+data arrives from various metrics feeds such as logs and so on. We want to decide
+upon a common set of columns so that, say HTTP status is always stored as a string
+in a column called "http_status", even if some of the inputs use names such as
+"status", "httpCode", or "statusCode". This is, in fact, the classic [ETL
+(extract, transform and load)](https://en.wikipedia.org/wiki/Extract,_transform,_load)
+design challenge.
+
+In this case, having a known, defined datasource schema becomes essential. The job
+of each of many ingestion specs or MSQ queries is to transform the incoming data from
+the format in which it appears into the form we've designed for our datasource.
+While table metadata is generally useful, it is this many-to-one mapping scenario
+where it finds its greatest need.
+
+### Table Metadata is Optional
+
+Table metadata is a new feature. Existing Druid systems have physical datasource
+metadata, but no logical data. As a result, table metdata can be thought of as a set
+of "hints": optional information that, if present, provides additional features not
+available when only using the existing physical metadata. See the
+[datasource section](./datasource.md) for information on the use of metadata with
+datasources. Without external metadata, you can use the existing `extern()` table
+function to specify an MSQ input source, format and schema.
+
+### Table Metadata is Extensible
+
+This documentation explains the Druid-defined properties available for datasources
+and external tables. The metadata system is extensible: you can define other
+properties, as long as those properties use names distinct from the Druid-defined
+names. For example, you might want to track the source of the data, or might want
+to include display name or display format properties to help your UI display
+columns.
+
+## Overview of Table Metadata
+
+Here is a typical scenario to explain how table metadata works. Lets say we want
+to create a new datasource. (Although you can use table metadata with an existing
+datasource, this explaination will focus on a new one.) We know our input data is
+not quite what dashboard users expect, so we want to clean things up a bit.
+
+### Design the Datasource
+
+We start with pencil and paper (actually, a text editor or spreadsheet) and we
+list the input columns and their types. Next, we think about the name we want to
+appear for that column in our UI, and the data type we'd prefer. If we are going
+to use Druid's rollup feature, we also want to identify the aggregation we wish
+to use. The result might look like this:
+
+```text
+Input source: web-metrics.csv
+Datasource: webMetrics
+
+    Input Column               Datasource Column
+Name          Type        Name       Type      Aggregation
+----          ----        ----       ----      -----------
+timestamp     ISO string  __time     TIMESTAMP
+host          string      host       VARCHAR
+referer       string
+status        int         httpStatus VARCHAR
+request-size  int         bytesIn    BIGINT    SUM
+response-size int         bytesOut   BIGINT    SUM
+...
+```
+
+Here, we've decided we don't need the referer. For our metrics use, the referer
+might not be helpful, so we just ignore it.
+
+Notice that we use arbitrary types for the CSV input. CSV doesn't really have
+types. We'll need to convert these to SQL types later. For datasource columns,
+we use SQL types, not Druid types. There is a SQL type for each Druid type and
+Druid aggregation. See [Datasource metadata](./datasource.md) for details.
+
+For aggregates, we use a set of special aggregate types described [here](./datasource.md).
+In our case, we use `SUM(BIGINT)` for the aggregate types.
+
+If we are using rollup, we may want to truncate our timestamp. Druid rolls up
+to the millsecond level by default. Perhaps we want to roll up to the second
+level (that is, combine all data for a given host and status that occurs within
+any one second of time.) We do by specifying a parameter to the time data type:
+`TIMESTAMP('PT1S')`.
+
+A Druid datasource needs two additional properties: our time partitioning
+and clustering. [Time partitioning](../multi-stage-query/reference.html#partitioned-by)
+gives the size of the time chunk stored in each segment file. Perhaps our system is of
+moderate size, and we want our segments to hold a day of data, or `P1D`.
+
+We can also choose [clustering](../multi-stage-query/reference.html#clustered-by), which
+is handy as data sizes expand. Suppose we want to cluster on `host`.
+
+We have the option of "sealing" the schema, meaning that we _only_ want these columns
+to be ingested, but no others. We'll do that here. The default option is to be
+unsealed, meaning that ingestion can add new columns at will without the need to first
+declare them in the schema. Use the option that works best for your use case.
+
+We are now ready to define our datasource metadata. We can craft a JSON payload with the
+information (handy if we write a script to do the translation work.) Or, we
+can use the Druid web console. Either way, we create a metadata entry for
+our `webMetrics` datasource that contains the columns and types above.
+
+Here's an example of the JSON payload:
+
+```JSON
+TO DO
+```
+
+If using the REST API, we create the table using the (TODO) API (NEED LINK).
+
+### Define our Input Data
+
+We plan to load new data every day, just after midnight, so our ingestion frequency
+matches our time partitioning. (As our system grows, we will probably ingest more
+frequently, and perhaps shift to streaming ingestion, but let's start simple.)
+We don't want to repeat the same information on each ingest. Instead, we want to
+store that information once and reuse it.
+
+To make things simple, let's assume that Druid runs on only one server, so we can
+load from a local directory. Let's say we always put the data in `/var/metrics/web`
+and we name our directories with the date: `2022-12-02`, etc. Each directory has any
+number of files for that day. Let's further suppose that we know that our inputs are
+always CSV, and always have the columns we listed above.
+
+We model this in Druid by defining an external table that includes everything about
+the input _except_ the set of files (which change each day). Let's call our external
+table `webInput`. All external tables reside in the `ext` schema (namespace).
+
+We again use the (TODO) API, this time with a JSON payload for the eternal table.
+The bulk of the information is simliar to what we use in an MSQ `extern` function:
+the JSON serialized form of an input source and format. But, the input source is
+"partial", we leave off the actual files.
+
+Notice we use SQL types when defining the eternal table. Druid uses this information
+to parse the CSV file coluns into the desired SQL (and Druid) type.
+
+```JSON
+TO DO
+```
+
+### MSQ Ingest
+
+Our next step is to load some data. We'll use MSQ. We've defined our datasource, and
+the input files from which we'll read. We use MSQ to define the mapping from the
+input to the output.
+
+```sql
+INSERT INTO webMetrics
+SELECT
+  TIME_PARSE(timestamp) AS __time,
+  host,
+  CAST(status AS VARCHAR) AS httpStatus,
+  SUM("request-size") AS bytesIn,
+  SUM("response-size") AS bytesOut
+FROM ext.webInput(filePattern => '2202-12-02/*.csv')
+GROUP BY __time, host, status
+```
+
+Some things to notice:
+
+* Druid's `INSERT` statement has a special way to match columns. Unlike standard SQL,
+which matches by position, Druid SQL matches by name. Thus, we must use `AS` clauses
+to ensure our data has the same name as the column in our datasource.
+* We do not include a `TIME_FLOOR` function because Druid inserts that automatically
+based on the `TIMESTAMP('PT5M')` data type.
+* We include a `CAST` to convert `status` (a number) to `httpStatus`
+(as string). Druid will validate the type based on the table metadata.
+* We include the aggregations for the two aggregated fields.
+The aggregations must match those defined in table metadata.
+* We also must explicitly spell out the `GROUP BY` clause which must include all
+non-aggregated columns.
+* Notice that there is no `PARTITIONED BY` or `CLUSTERED BY` clauses: Druid fills those
+in from table metadata.
+
+If we make a mistake, and omit a required column alias, or use the wrong alias, the
+MSQ query will fail. This is part of the data governance idea: Druid just prevented us
+from creating segments that don't satisfy our declared table schema.
+
+### Query and Schema Evolution
+
+We can now query our data. Though it is not obvious in this example, the query experience
+is also guided by table metadata. To see this, suppose that we soon discover that no one
+actually needs the `bytesIn` field: our server doesn't accept `POST` requets and so all

Review Comment:
   ```suggestion
   actually needs the `bytesIn` field: our server doesn't accept `POST` requests and so all
   ```



##########
docs/catalog/ext.md:
##########
@@ -0,0 +1,491 @@
+---
+id: ext
+title: External Tables
+sidebar_label: External Tables
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## External Tables
+
+Druid ingests data from external sources. Druid has three main ways to perform
+ingestion:
+
+* [MSQ](../multi-stage-query/reference.md#extern) (SQL-based ingestion)
+* "Classic" [batch ingestion](../ingestion/ingestion-spec.md#ioconfig)
+* Streaming ingestion
+
+In each cases, Druid reads from an external source defined by a triple of
+input source, input format and (with MSQ) input signature. MSQ uses SQL
+to define ingestion, and SQL works with tables. Thus, with MSQ, we call
+the external source an _external table_. Druid external tables are much
+like those of similar systems (such as Hive), except that Druid, as a
+database, never reads the same data twice. This leads to the idea that
+Druid external tables are most often _parameterized_: we provide at least
+the names of the files (or objects) to read on each ingestion.
+
+Druid provides three levels of parameterization:
+
+* An ad-hoc table, such as those defined by the `extern` and similar
+  table functions, in which you specify all attributes that define the table.
+* A _partial table_, defined in the Druid catalog, that specifies some of the
+  table information (perhaps the S3 configuration and perhaps the file format
+  and column layout), leaving the rest to be provided via a named table function
+  at ingest time.
+* A fully-defined table in which all information is provided in the catalog, and
+  the table acts like a regular SQL table at run time. (Such tables are generally
+  useful in Druid for testing or when using MSQ to run queries. As noted above,
+  this format is less useful for ingestion.)
+
+See [MSQ](../multi-stage-query/reference.md) for how to define an ad-hoc table.
+Here we focus on the use
+of table metadata in the Druid catalog. An external table is just another kind
+of table in Druid's catalog: you use the same REST API methods to define and
+retrieve information about external tables as you would to wor with datasource

Review Comment:
   ```suggestion
   retrieve information about external tables as you would to work with the datasource
   ```



##########
docs/catalog/api.md:
##########
@@ -0,0 +1,261 @@
+---
+id: intro
+title: Table Metadata Catalog
+sidebar_label: Intro
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+The table metadata feature adds a number of new REST APIs to create, read, update and delete
+(CRUD) table metadata entries. Table metadata is a set of "hints" on top of physical datasources, and
+a set of convenience entries for external tables. Changing the metadata "hints" has no effect
+on existing datasources or segments, or on in-flight ingestion jobs. Metadata changes only
+affect future queries and integestion jobs.
+
+**Note**: a future enhancment might allow a single operation to, say, delete both a metadata
+entry and the actual datasource. That functionality is not yet available. In this version, each
+metadata operation affects _only_ the information in the table metadata table within Druid's
+metadata store.
+
+A metadata table entry is called a "table specification" (or "table spec") and is represented
+by a Java class called `TableSpec`. The JSON representation of this object appears below.
+
+The API calls are designed for two use cases:
+
+* Configuration-as-code, in which the application uploads a complete table spec. Like in
+Kubernetes or other deployment tools, the source of truth for specs is in a source repostitory,
+typically Git. The user deploys new table specs to Druid as part of rolling out an updated
+application.
+* UI-driven, in which you use the Druid Console or similar tool to edit the table metadata
+directly.
+
+The configuration-as-code use case works with entire specs. The UI use case may work with an
+entire spec, or may perform focused "edit" operations. Here it is important to remember that
+Druid is a distributed, multi-user system: it could be that multiple users edit the same
+table at the same time. The API to overwrite a table spec provides a version number to allow
+a UI to implement optimistic locking: if the current verion is not the one the UI read, then
+Druid rejects the change. A set of "edit" operations perform focused chagnges that avoid the

Review Comment:
   > A set of "edit" operations perform focused chagnges that avoid the need for optimistic locking: Druid applies the change rather than the UI.
   
   Can you please elaborate a bit on this further? I am unable to understand why is there a requirement for the version number in the UI case. Alternatively, what can be the downside if multiple users edit the catalog information and they get overwritten in a first-come-first-server manner? 



##########
docs/catalog/ext.md:
##########
@@ -0,0 +1,491 @@
+---
+id: ext
+title: External Tables
+sidebar_label: External Tables
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## External Tables
+
+Druid ingests data from external sources. Druid has three main ways to perform
+ingestion:
+
+* [MSQ](../multi-stage-query/reference.md#extern) (SQL-based ingestion)
+* "Classic" [batch ingestion](../ingestion/ingestion-spec.md#ioconfig)
+* Streaming ingestion
+
+In each cases, Druid reads from an external source defined by a triple of
+input source, input format and (with MSQ) input signature. MSQ uses SQL
+to define ingestion, and SQL works with tables. Thus, with MSQ, we call
+the external source an _external table_. Druid external tables are much
+like those of similar systems (such as Hive), except that Druid, as a
+database, never reads the same data twice. This leads to the idea that
+Druid external tables are most often _parameterized_: we provide at least
+the names of the files (or objects) to read on each ingestion.
+
+Druid provides three levels of parameterization:
+
+* An ad-hoc table, such as those defined by the `extern` and similar
+  table functions, in which you specify all attributes that define the table.
+* A _partial table_, defined in the Druid catalog, that specifies some of the
+  table information (perhaps the S3 configuration and perhaps the file format
+  and column layout), leaving the rest to be provided via a named table function
+  at ingest time.
+* A fully-defined table in which all information is provided in the catalog, and
+  the table acts like a regular SQL table at run time. (Such tables are generally
+  useful in Druid for testing or when using MSQ to run queries. As noted above,
+  this format is less useful for ingestion.)
+
+See [MSQ](../multi-stage-query/reference.md) for how to define an ad-hoc table.
+Here we focus on the use
+of table metadata in the Druid catalog. An external table is just another kind
+of table in Druid's catalog: you use the same REST API methods to define and
+retrieve information about external tables as you would to wor with datasource
+tables.
+
+External tables do, however, have a unique set of properties. External tables
+specifications reside in the `ext` schema within Druid.
+
+## External Table Specificaation
+
+An external table has the type `external`, and has two key properties:
+`source` and `format`.
+
+### `description` (String)
+
+The description property hold human-readable text to describe the external table.
+Druid does not yet use this property, though the intent is that the Druid Console
+will display this text in some useful way.
+
+```json
+{
+  "type" : "extern",
+  "properties" : {
+    "description" : "Logs from Apache web servers",
+    ...
+  },
+}
+```
+
+### `source` (InputSource object)
+
+The `source` property is the JSON serialialized form of a Druid `InputSource`
+object as described on the [data formats page](../ingestion/data-formats.md).
+The JSON here is identical to that
+you would use as the first argument to an MSQ `extern` function. As explained
+below, you will often leave off some of the properties in order to define
+a partial table for ingestion. See the sections on each input source for examples.
+
+### `format` (InputFormat object)
+
+The `format` property is the JSON serialized form of a Druid `InputFormat`
+object as described here (LINK NEEDED).  The JSON here is identical to that
+you would use as the first argument to an MSQ `extern` function. The format
+is optional. If provided in the table specification, then it need not be
+repeated at ingest time. Use this choice when you have a set of
+identically-formatted files you read on each ingest, such as reading today's
+batch of web server log files in which the format itself stays constant.
+
+Example:
+
+```json
+{
+  "type" : "extern",
+  "properties" : {
+    "format" : {
+      "type" : "csv",
+      "listDelimiter" : ";"
+    },
+    ...
+}
+```
+
+You can also omit the format, which case you must provide it at ingest time.
+Use this option if you want to define a _connection_: say the properties of
+an S3 bucket, but you store multiple file types and formats in that bucket,
+and so have to provide the specific format at ingest time.
+
+When you do provide a format, the format is complete. Unlike the source,
+you either provide an entire `InputFormat` spec, or none at all.
+
+### External Columns
+
+The set of external table columns can be provided in the table specification
+in the catalog entry, or at run time. To provide columns in the table
+specification, provide both a name and a SQL type:
+
+```json
+{
+  "type" : "extern",
+  "properties" : {
+    ...
+  },
+  "columns" : [ {
+    "name" : "timetamp",
+    "sqlType" : "VARCHAR"
+  }, {
+    "name" : "sendBytes",
+    "sqlType" : "BIGINT"
+  } ]
+}
+```
+
+External tables cannot use Druid aggregate types.
+
+### Using a Complete Table Specification
+
+Suppose we wish to define an external table for the Wikipedia data, which
+is available on our local system at `$DRUID_HOME/quickstart/tutorial`. Let us assume
+Druid is installed at `/Users/bob/druid`. The data uses
+is compressed JSON. The complete table specification looks like this:
+
+```json
+{
+  "type" : "extern",
+  "properties" : {
+    "format" : {
+      "type" : "json",
+      "keepNullColumns" : true,
+      "assumeNewlineDelimited" : true,
+      "useJsonNodeReader" : false
+    },
+    "description" : "Sample Wikipedia data",
+    "source" : {
+      "type" : "local",
+      "baseDir" : "/Users/bob/druid/quickstart/tutorial",
+      "filter" : "wikiticker-*-sampled.json"
+    }
+  },
+  "columns" : [ {
+    "name" : "timetamp",
+    "sqlType" : "VARCHAR"
+  }, {
+    "name" : "page",
+    "sqlType" : "VARCHAR"
+  }, {
+    "name" : "language",
+    "sqlType" : "VARCHAR"
+  }, {
+    "name" : "unpatrolled",
+    "sqlType" : "VARCHAR"
+  }, {
+    "name" : "newPage",
+    "sqlType" : "VARCHAR"
+  }, {
+    "name" : "robot",
+    "sqlType" : "VARCHAR"
+  }, {
+    "name" : "namespace",
+    "sqlType" : "BIGINT"
+  }, ... {
+    "name" : "added",
+    "sqlType" : "VARCHAR"
+  }, {
+    "name" : "deleted",
+    "sqlType" : "BIGINT"
+  }, {
+    "name" : "delta",
+    "sqlType" : "BIGINT"
+  } ]
+}
+```
+
+Let us call the table `wikipedia`. As noted above, the table resides in
+the `ext` schema.
+
+Then, we can ingest the data, into a datasource also named `wikipedia`, using MSQ as follows:
+
+```sql
+INSERT INTO wikipedia
+SELECT *
+FROM ext.wikipedia
+```
+
+The real intestion would do some data transforms such as converting the "TRUE"/"FALSE" fields

Review Comment:
   ```suggestion
   The real ingestion would do some data transforms such as converting the "TRUE"/"FALSE" fields
   ```



##########
docs/catalog/ext.md:
##########
@@ -0,0 +1,491 @@
+---
+id: ext
+title: External Tables
+sidebar_label: External Tables
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## External Tables
+
+Druid ingests data from external sources. Druid has three main ways to perform
+ingestion:
+
+* [MSQ](../multi-stage-query/reference.md#extern) (SQL-based ingestion)
+* "Classic" [batch ingestion](../ingestion/ingestion-spec.md#ioconfig)
+* Streaming ingestion
+
+In each cases, Druid reads from an external source defined by a triple of
+input source, input format and (with MSQ) input signature. MSQ uses SQL
+to define ingestion, and SQL works with tables. Thus, with MSQ, we call
+the external source an _external table_. Druid external tables are much
+like those of similar systems (such as Hive), except that Druid, as a
+database, never reads the same data twice. This leads to the idea that
+Druid external tables are most often _parameterized_: we provide at least
+the names of the files (or objects) to read on each ingestion.
+
+Druid provides three levels of parameterization:
+
+* An ad-hoc table, such as those defined by the `extern` and similar
+  table functions, in which you specify all attributes that define the table.
+* A _partial table_, defined in the Druid catalog, that specifies some of the
+  table information (perhaps the S3 configuration and perhaps the file format
+  and column layout), leaving the rest to be provided via a named table function
+  at ingest time.
+* A fully-defined table in which all information is provided in the catalog, and
+  the table acts like a regular SQL table at run time. (Such tables are generally
+  useful in Druid for testing or when using MSQ to run queries. As noted above,
+  this format is less useful for ingestion.)
+
+See [MSQ](../multi-stage-query/reference.md) for how to define an ad-hoc table.
+Here we focus on the use
+of table metadata in the Druid catalog. An external table is just another kind
+of table in Druid's catalog: you use the same REST API methods to define and
+retrieve information about external tables as you would to wor with datasource
+tables.
+
+External tables do, however, have a unique set of properties. External tables
+specifications reside in the `ext` schema within Druid.
+
+## External Table Specificaation
+
+An external table has the type `external`, and has two key properties:
+`source` and `format`.
+
+### `description` (String)
+
+The description property hold human-readable text to describe the external table.

Review Comment:
   ```suggestion
   The description property holds human-readable text to describe the external table.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org