You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by "317brian (via GitHub)" <gi...@apache.org> on 2023/04/11 18:20:19 UTC

[GitHub] [druid] 317brian opened a new pull request, #14065: docs: add docs for schema auto-discovery

317brian opened a new pull request, #14065:
URL: https://github.com/apache/druid/pull/14065

   Updates to the ingestion spec docs and related pages for schema auto-discovery. This PR also removes most of the info related to the previous schemaless ingestion where everything was treated as a string.
   
   Build preview that people can read the rendered PR in: https://druid-git-schemaless-docs-317brian.vercel.app/docs/ingestion/ingestion-spec.html#dimensionsspec
   
   #### Release note
   n/a - covered by the code PR
   
   <hr>
   
   This PR has:
   
   - [x] been self-reviewed.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] ektravel commented on a diff in pull request #14065: docs: add docs for schema auto-discovery

Posted by "ektravel (via GitHub)" <gi...@apache.org>.
ektravel commented on code in PR #14065:
URL: https://github.com/apache/druid/pull/14065#discussion_r1190222656


##########
docs/ingestion/ingestion-spec.md:
##########
@@ -212,7 +224,9 @@ A `dimensionsSpec` can have the following components:
 | `dimensions`           | A list of [dimension names or objects](#dimension-objects). You cannot include the same column in both `dimensions` and `dimensionExclusions`.<br /><br />If `dimensions` and `spatialDimensions` are both null or empty arrays, Druid treats all columns other than timestamp or metrics that do not appear in `dimensionExclusions` as String-typed dimension columns. See [inclusions and exclusions](#inclusions-and-exclusions) for details.<br /><br />As a best practice, put the most frequently filtered dimensions at the beginning of the dimensions list. In this case, it would also be good to consider [`partitioning`](partitioning.md) by those same dimensions.                                                                                                                                                                                                                                  | `[]`    |
 | `dimensionExclusions`  | The names of dimensions to exclude from ingestion. Only names are supported here, not objects.<br /><br />This list is only used if the `dimensions` and `spatialDimensions` lists are both null or empty arrays; otherwise it is ignored. See [inclusions and exclusions](#inclusions-and-exclusions) below for details.                                                                                                                                                                                                                                                                                                                                               | `[]`    |
 | `spatialDimensions`    | An array of [spatial dimensions](../development/geo.md).                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | `[]`    |
-| `includeAllDimensions` | You can set `includeAllDimensions` to true to ingest both explicit dimensions in the `dimensions` field and other dimensions that the ingestion task discovers from input data. In this case, the explicit dimensions will appear first in order that you specify them and the dimensions dynamically discovered will come after. This flag can be useful especially with auto schema discovery using [`flattenSpec`](./data-formats.html#flattenspec). If this is not set and the `dimensions` field is not empty, Druid will ingest only explicit dimensions. If this is not set and the `dimensions` field is empty, all discovered dimensions will be ingested. | false   |
+| `includeAllDimensions` | Note that this field only applies to schema-less ingestion where Druid ingests dimensions it discovers as strings. This is different from schema auto-discovery where Druid infers the type for data. You can set `includeAllDimensions` to true to ingest both explicit dimensions in the `dimensions` field and other dimensions that the ingestion task discovers from input data. In this case, the explicit dimensions will appear first in order that you specify them and the dimensions dynamically discovered will come after. This flag can be useful especially with auto schema discovery using [`flattenSpec`](./data-formats.html#flattenspec). If this is not set and the `dimensions` field is not empty, Druid will ingest only explicit dimensions. If this is not set and the `dimensions` field is empty, all discovered dimensions will be ingested. | false   |

Review Comment:
   ```suggestion
   | `includeAllDimensions` | Note that this field only applies to schema-less ingestion where Druid ingests dimensions it discovers as strings. This is different from schema auto-discovery where Druid infers the type for data. You can set `includeAllDimensions` to true to ingest both explicit dimensions in the `dimensions` field and other dimensions that the ingestion task discovers from input data. In this case, the explicit dimensions will appear first in the order that you specify them, and the dimensions dynamically discovered will come after. This flag can be useful especially with auto schema discovery using [`flattenSpec`](./data-formats.html#flattenspec). If this is not set and the `dimensions` field is not empty, Druid will ingest only explicit dimensions. If this is not set and the `dimensions` field is empty, all discovered dimensions will be ingested. | false   |
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on a diff in pull request #14065: docs: add docs for schema auto-discovery

Posted by "clintropolis (via GitHub)" <gi...@apache.org>.
clintropolis commented on code in PR #14065:
URL: https://github.com/apache/druid/pull/14065#discussion_r1181974883


##########
docs/ingestion/schema-design.md:
##########
@@ -120,8 +120,8 @@ you must be more explicit. Druid columns have types specific upfront.
 
 Tips for modeling log data in Druid:
 
-* If you don't know ahead of time what columns you'll want to ingest, use an empty dimensions list to trigger
-[automatic detection of dimension columns](#schema-less-dimensions).
+* If you don't know ahead of time what columns you'll want to ingest, set `dimensionsSpec.useSchemaDiscovery` to `true` and use an empty or partially defined dimensions list to use
+[schema auto-discovery](#schema-auto-discovery-for-dimensions).

Review Comment:
   since there are two modes, I think this should just link to the schema discovery section without digging into how to enable it



##########
docs/ingestion/schema-design.md:
##########
@@ -241,12 +241,14 @@ You should query for the number of ingested rows with:
 ]
 ```
 
-### Schema-less dimensions
+### Schema auto-discovery for dimensions
+
+You can have Druid infer the schema for your data partially or fully by setting `dimensionsSpec.useSchemaDiscovery` to `true` and defining some or no dimensions in the dimensions list. For any dimensions that aren't a uniform type, Druid ingests them as JSON.
 
-If the `dimensions` field is left empty in your ingestion spec, Druid will treat every column that is not the timestamp column,
-a dimension that has been excluded, or a metric column as a dimension.
+If you do not set `dimensionsSpec.useSchemaDiscovery` to `true`, Druid can still use the string-based schemaless ingestion if either of the following conditions are met: 
 
-Note that when using schema-less ingestion, all dimensions will be ingested as String-typed dimensions.
+- The dimension list is empty 
+- You set `includeAllDimensions` to `true` 
 

Review Comment:
   While we have these two modes, I think it is worth more clearly describing them, the differences between them particularly with regards to multi-value dimensions and array types, and call out that there currently isn't an effortless migration path between the two modes.
   
   We should probably give them semi-official names, I recommend "string-based schema discovery" and "type-aware schema discovery".
   
   For string-based schema discovery, the important parts are that all columns it discovers will be either native single or multi-value `STRING`. It will only discover primitive and arrays of primitive types, and coerce them in both cases to native Druid `STRING` types. Anything that is a nested data structure, or arrays of nested data structures, like JSON objects or whatever, will be ignored and not ingested.
   
   For type-aware schema discovery, it will discover primitive types and preserve them as the most appropriate native Druid type: `STRING`, `DOUBLE`, or `LONG`. Input formats with native boolean types will be ingested as `STRING` or `LONG` depending on the value of `druid.expressions.useStrictBooleans`. Some formats have specific coercion rules for things like enums and other special types, which typically will be coerced to a `STRING`, but this is determined by the input format, not schema discovery, and is consistent with however `flattenSpec` handles these types (i'm not sure that this necessarily needs called out, since how these types of values are coerced is dependent on the underlying format).
   
   For type-aware mode, arrays of primitives in the input data will be ingested as native `ARRAY` typed columns, `ARRAY<STRING>`, `ARRAY<DOUBLE>`, or `ARRAY<LONG>`. These have `ARRAY` semantics in SQL and native queries, so are typically interacted with using the array functions (`ARRAY_` prefix) and/or `UNNEST`, in contrast with multi-value strings which can be queried using either `VARCHAR` or `ARRAY` semantics (the latter using the `MV_` functions) and have an implicit unnest, but only when grouping.
   
   Type-aware schema discovery also will ingest mixed type input, complex nested structures, and arrays of complex nested structures as `COMPLEX<json>` as a sort of catch-all. This means that unlike string-based schema discovery, type-aware schema discovery will find _all_ of the columns in the input data.
   
   The migration path we are planning between these two modes is not effortless. The biggest differences are the new mode using ARRAY types instead of MVDs, but also any mixed typed inputs might now be ingested as `COMPLEX<json>` instead of coerce to strings, which will prevent grouping (for now... this is going to change in the future). Numeric types, I think shouldn't cause too much trouble since Druid should try best effort to coerce the values as needed to successfully complete the query, but the SQL schema will change to start validating these values as numeric, so might need the occasional explicit cast.
   
   The current plan for the approach on migrating MVDs is that queries should be migrated to explicitly start using `UNNEST`, to ensure that no implicit multi-value behavior is being relied on in the query. We plan to introduce a config flag(s?) that require MVDs be handled explicitly as ARRAY types to ensure that all queries have been migrated to help with this, but haven't added such a flag for this release. Once the queries are migrated, then the new type-aware schema discovery mode can be enabled (after ensuring that any dimension exclusion lists are updated to include nested columns which previously would not have been automatically discovered, should they wish to continue excluding them). Alternatively, the new type-aware schema discovery mode can be enabled, but with explicitly defined 'string' dimensions for known MVDs or mixed type inputs which should continue to be ingested as MVDs until all of the queries can be migrated to using explicit UNNEST and/or `ARRAY_` functions.
   
   Its also worth calling out that the new type-aware schema discovery mode is experimental. I imagine it is likely we will remove the string-based schema discovery mode once the type-aware mode has matured, but until then we should clearly distinguish them.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] 317brian commented on a diff in pull request #14065: docs: add docs for schema auto-discovery

Posted by "317brian (via GitHub)" <gi...@apache.org>.
317brian commented on code in PR #14065:
URL: https://github.com/apache/druid/pull/14065#discussion_r1186533072


##########
docs/ingestion/schema-design.md:
##########
@@ -241,12 +241,14 @@ You should query for the number of ingested rows with:
 ]
 ```
 
-### Schema-less dimensions
+### Schema auto-discovery for dimensions
+
+You can have Druid infer the schema for your data partially or fully by setting `dimensionsSpec.useSchemaDiscovery` to `true` and defining some or no dimensions in the dimensions list. For any dimensions that aren't a uniform type, Druid ingests them as JSON.
 
-If the `dimensions` field is left empty in your ingestion spec, Druid will treat every column that is not the timestamp column,
-a dimension that has been excluded, or a metric column as a dimension.
+If you do not set `dimensionsSpec.useSchemaDiscovery` to `true`, Druid can still use the string-based schemaless ingestion if either of the following conditions are met: 
 
-Note that when using schema-less ingestion, all dimensions will be ingested as String-typed dimensions.
+- The dimension list is empty 
+- You set `includeAllDimensions` to `true` 
 

Review Comment:
   Latest commit (6d5c97a) has changes based on the info



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] ektravel commented on a diff in pull request #14065: docs: add docs for schema auto-discovery

Posted by "ektravel (via GitHub)" <gi...@apache.org>.
ektravel commented on code in PR #14065:
URL: https://github.com/apache/druid/pull/14065#discussion_r1190234515


##########
docs/ingestion/schema-design.md:
##########
@@ -241,12 +240,45 @@ You should query for the number of ingested rows with:
 ]
 ```
 
-### Schema-less dimensions
+### Schema auto-discovery for dimensions
+
+Druid can infer the schema for your data in one of two ways:
+
+- [Type-aware schema discovery (experimental)](#type-aware-schema-discovery-experimental) where Druid infers the schema and type for your data. Type-aware schema discovery is experimental currently available for native batch and streaming ingestion.
+- [String-based schema discovery](#string-based-schema-discovery) where all the discovered columns are typed as either native string or multi-value string columns.
+
+#### Type-aware schema discovery
+
+> Note that using type-aware schema discovery can impact downstream BI tools depending on how they handle ARRAY typed columns..
+
+You can have Druid infer the schema and types for your data partially or fully by setting `dimensionsSpec.useSchemaDiscovery` to `true` and defining some or no dimensions in the dimensions list. 
+
+When performing type-aware schema discovery, Druid can discover all of the columns of your input data (that aren't in the exclusion list), but the behavior depends on the source data:
+
+- For primitive types, Druid ingests them as the most appropriate native Druid type, either string, double or long. 
+- For input formats with native boolean types will be ingested as strings or longs depending on the value of `druid.expressions.useStrictBooleans`.
+- For arrays, the data gets ingested as an array typed column: `ARRAY<STRING>`, `ARRAY<DOUBLE>`, or `ARRAY<LONG>`. You can then interact with these columns using ARRAY functions as well as UNNEST.
+- For mixed types, complex nested structures, and arrays of complex nested structures, Druid ingests the column as the least restrictive data type.
+
+If you're already using string-based schema discovery and want to migrate, see [Migrating to type-aware schema discovery](#migrating-to-type-aware-schema-discovery).
+
+#### String-based schema discovery
+
+If you do not set `dimensionsSpec.useSchemaDiscovery` to `true`, Druid can still use the string-based schema discovery for ingestion if any of the following conditions are met: 
+
+- The dimension list is empty 
+- You set `includeAllDimensions` to `true` 
+
+Druid coerces primitives and arrays of primitive types into the native Druid string type. Nested data structures and arrays of nested data structures are ignored and not ingested.
+
+#### Migrating to type-aware schema discovery
 
-If the `dimensions` field is left empty in your ingestion spec, Druid will treat every column that is not the timestamp column,
-a dimension that has been excluded, or a metric column as a dimension.
+If you previously used string-based schema discovery and want to migrate to type-aware schema discovery, you'll need to do the following:
 
-Note that when using schema-less ingestion, all dimensions will be ingested as String-typed dimensions.
+- Update any queries that use multi-value dimensions (MVDs) to use UNNEST in conjunction with other functions so that no  MVD behavior is being relied upon. Type-aware schema discovery generates ARRAY typed columns instead of MVDs, so queries that use any MVD features will fail.

Review Comment:
   ```suggestion
   - Update any queries that use multi-value dimensions (MVDs) to use UNNEST in conjunction with other functions so that no MVD behavior is being relied upon. Type-aware schema discovery generates ARRAY typed columns instead of MVDs, so queries that use any MVD features will fail.
   ```



##########
docs/ingestion/schema-design.md:
##########
@@ -241,12 +240,45 @@ You should query for the number of ingested rows with:
 ]
 ```
 
-### Schema-less dimensions
+### Schema auto-discovery for dimensions
+
+Druid can infer the schema for your data in one of two ways:
+
+- [Type-aware schema discovery (experimental)](#type-aware-schema-discovery-experimental) where Druid infers the schema and type for your data. Type-aware schema discovery is experimental currently available for native batch and streaming ingestion.
+- [String-based schema discovery](#string-based-schema-discovery) where all the discovered columns are typed as either native string or multi-value string columns.
+
+#### Type-aware schema discovery
+
+> Note that using type-aware schema discovery can impact downstream BI tools depending on how they handle ARRAY typed columns..
+
+You can have Druid infer the schema and types for your data partially or fully by setting `dimensionsSpec.useSchemaDiscovery` to `true` and defining some or no dimensions in the dimensions list. 
+
+When performing type-aware schema discovery, Druid can discover all of the columns of your input data (that aren't in the exclusion list), but the behavior depends on the source data:
+
+- For primitive types, Druid ingests them as the most appropriate native Druid type, either string, double or long. 
+- For input formats with native boolean types will be ingested as strings or longs depending on the value of `druid.expressions.useStrictBooleans`.
+- For arrays, the data gets ingested as an array typed column: `ARRAY<STRING>`, `ARRAY<DOUBLE>`, or `ARRAY<LONG>`. You can then interact with these columns using ARRAY functions as well as UNNEST.
+- For mixed types, complex nested structures, and arrays of complex nested structures, Druid ingests the column as the least restrictive data type.
+
+If you're already using string-based schema discovery and want to migrate, see [Migrating to type-aware schema discovery](#migrating-to-type-aware-schema-discovery).
+
+#### String-based schema discovery
+
+If you do not set `dimensionsSpec.useSchemaDiscovery` to `true`, Druid can still use the string-based schema discovery for ingestion if any of the following conditions are met: 
+
+- The dimension list is empty 
+- You set `includeAllDimensions` to `true` 
+
+Druid coerces primitives and arrays of primitive types into the native Druid string type. Nested data structures and arrays of nested data structures are ignored and not ingested.
+
+#### Migrating to type-aware schema discovery
 
-If the `dimensions` field is left empty in your ingestion spec, Druid will treat every column that is not the timestamp column,
-a dimension that has been excluded, or a metric column as a dimension.
+If you previously used string-based schema discovery and want to migrate to type-aware schema discovery, you'll need to do the following:
 
-Note that when using schema-less ingestion, all dimensions will be ingested as String-typed dimensions.
+- Update any queries that use multi-value dimensions (MVDs) to use UNNEST in conjunction with other functions so that no  MVD behavior is being relied upon. Type-aware schema discovery generates ARRAY typed columns instead of MVDs, so queries that use any MVD features will fail.
+- Be aware of mixed typed inputs and test how type-aware schema discovery handles them. Druid attempts to cast them as the least restrictive type.
+- If you notice issues with numeric types, you may need to explicitly cast them. Generally though, Druid will handle the coercion for you.

Review Comment:
   ```suggestion
   - If you notice issues with numeric types, you may need to explicitly cast them. Generally, Druid handles the coercion for you.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on a diff in pull request #14065: docs: add docs for schema auto-discovery

Posted by "clintropolis (via GitHub)" <gi...@apache.org>.
clintropolis commented on code in PR #14065:
URL: https://github.com/apache/druid/pull/14065#discussion_r1182081460


##########
docs/ingestion/schema-design.md:
##########
@@ -241,12 +241,14 @@ You should query for the number of ingested rows with:
 ]
 ```
 
-### Schema-less dimensions
+### Schema auto-discovery for dimensions
+
+You can have Druid infer the schema for your data partially or fully by setting `dimensionsSpec.useSchemaDiscovery` to `true` and defining some or no dimensions in the dimensions list. For any dimensions that aren't a uniform type, Druid ingests them as JSON.
 
-If the `dimensions` field is left empty in your ingestion spec, Druid will treat every column that is not the timestamp column,
-a dimension that has been excluded, or a metric column as a dimension.
+If you do not set `dimensionsSpec.useSchemaDiscovery` to `true`, Druid can still use the string-based schemaless ingestion if either of the following conditions are met: 
 
-Note that when using schema-less ingestion, all dimensions will be ingested as String-typed dimensions.
+- The dimension list is empty 
+- You set `includeAllDimensions` to `true` 
 

Review Comment:
   >We plan to introduce a config flag(s?) that require MVDs be handled explicitly as ARRAY types to ensure that all queries have been migrated to help with this, but haven't added such a flag for this release.
   
   btw, leave this part out of the docs until/if we actually add such a thing



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] ektravel commented on a diff in pull request #14065: docs: add docs for schema auto-discovery

Posted by "ektravel (via GitHub)" <gi...@apache.org>.
ektravel commented on code in PR #14065:
URL: https://github.com/apache/druid/pull/14065#discussion_r1190225547


##########
docs/ingestion/schema-design.md:
##########
@@ -241,12 +240,45 @@ You should query for the number of ingested rows with:
 ]
 ```
 
-### Schema-less dimensions
+### Schema auto-discovery for dimensions
+
+Druid can infer the schema for your data in one of two ways:
+
+- [Type-aware schema discovery (experimental)](#type-aware-schema-discovery-experimental) where Druid infers the schema and type for your data. Type-aware schema discovery is experimental currently available for native batch and streaming ingestion.

Review Comment:
   ```suggestion
   - [Type-aware schema discovery (experimental)](#type-aware-schema-discovery-experimental) where Druid infers the schema and type for your data. Type-aware schema discovery is an experimental feature currently available for native batch and streaming ingestion.
   ```



##########
docs/ingestion/schema-design.md:
##########
@@ -241,12 +240,45 @@ You should query for the number of ingested rows with:
 ]
 ```
 
-### Schema-less dimensions
+### Schema auto-discovery for dimensions
+
+Druid can infer the schema for your data in one of two ways:
+
+- [Type-aware schema discovery (experimental)](#type-aware-schema-discovery-experimental) where Druid infers the schema and type for your data. Type-aware schema discovery is experimental currently available for native batch and streaming ingestion.

Review Comment:
   ```suggestion
   - [Type-aware schema discovery (experimental)](#type-aware-schema-discovery-experimental) where Druid infers the schema and type for your data. Type-aware schema discovery is an experimental feature currently available for native batch and streaming ingestion.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] 317brian commented on a diff in pull request #14065: docs: add docs for schema auto-discovery

Posted by "317brian (via GitHub)" <gi...@apache.org>.
317brian commented on code in PR #14065:
URL: https://github.com/apache/druid/pull/14065#discussion_r1192490307


##########
website/.spelling:
##########
@@ -307,6 +307,7 @@ firefox
 firehose
 firehoses
 fromPigAvroStorage
+frontcoded

Review Comment:
   Huh, not sure where this came from. I just updated it to remove it and added the acronym MVDs



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] ektravel commented on a diff in pull request #14065: docs: add docs for schema auto-discovery

Posted by "ektravel (via GitHub)" <gi...@apache.org>.
ektravel commented on code in PR #14065:
URL: https://github.com/apache/druid/pull/14065#discussion_r1190217776


##########
docs/configuration/index.md:
##########
@@ -1520,7 +1520,7 @@ Additional peon configs include:
 |`druid.indexer.task.hadoopWorkingPath`|Temporary working directory for Hadoop tasks.|`/tmp/druid-indexing`|
 |`druid.indexer.task.restoreTasksOnRestart`|If true, MiddleManagers will attempt to stop tasks gracefully on shutdown and restore them on restart.|false|
 |`druid.indexer.task.ignoreTimestampSpecForDruidInputSource`|If true, tasks using the [Druid input source](../ingestion/native-batch-input-source.md) will ignore the provided timestampSpec, and will use the `__time` column of the input datasource. This option is provided for compatibility with ingestion specs written before Druid 0.22.0.|false|
-|`druid.indexer.task.storeEmptyColumns`|Boolean value for whether or not to store empty columns during ingestion. When set to true, Druid stores every column specified in the [`dimensionsSpec`](../ingestion/ingestion-spec.md#dimensionsspec). If you use schemaless ingestion and don't specify any dimensions to ingest, you must also set [`includeAllDimensions`](../ingestion/ingestion-spec.md#dimensionsspec) for Druid to store empty columns.<br/><br/>If you set `storeEmptyColumns` to false, Druid SQL queries referencing empty columns will fail. If you intend to leave `storeEmptyColumns` disabled, you should either ingest dummy data for empty columns or else not query on empty columns.<br/><br/>This configuration can be overwritten by setting `storeEmptyColumns` in the [task context](../ingestion/tasks.md#context-parameters).|true|
+|`druid.indexer.task.storeEmptyColumns`|Boolean value for whether or not to store empty columns during ingestion. When set to true, Druid stores every column specified in the [`dimensionsSpec`](../ingestion/ingestion-spec.md#dimensionsspec). If you use the string-based schemaless ingestion and don't specify any dimensions to ingest, you must also set [`includeAllDimensions`](../ingestion/ingestion-spec.md#dimensionsspec) for Druid to store empty columns.<br/><br/>If you set `storeEmptyColumns` to false, Druid SQL queries referencing empty columns will fail. If you intend to leave `storeEmptyColumns` disabled, you should either ingest dummy data for empty columns or else not query on empty columns.<br/><br/>This configuration can be overwritten by setting `storeEmptyColumns` in the [task context](../ingestion/tasks.md#context-parameters).|true|

Review Comment:
   ```suggestion
   |`druid.indexer.task.storeEmptyColumns`|Boolean value for whether or not to store empty columns during ingestion. When set to true, Druid stores every column specified in the [`dimensionsSpec`](../ingestion/ingestion-spec.md#dimensionsspec). If you use the string-based schemaless ingestion and don't specify any dimensions to ingest, you must also set [`includeAllDimensions`](../ingestion/ingestion-spec.md#dimensionsspec) for Druid to store empty columns.<br/><br/>If you set `storeEmptyColumns` to false, Druid SQL queries referencing empty columns will fail. If you intend to leave `storeEmptyColumns` disabled, you should either ingest placeholder data for empty columns or else not query on empty columns.<br/><br/>You can overwrite this configuration  by setting `storeEmptyColumns` in the [task context](../ingestion/tasks.md#context-parameters).|true|
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis merged pull request #14065: docs: add docs for schema auto-discovery

Posted by "clintropolis (via GitHub)" <gi...@apache.org>.
clintropolis merged PR #14065:
URL: https://github.com/apache/druid/pull/14065


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on a diff in pull request #14065: docs: add docs for schema auto-discovery

Posted by "clintropolis (via GitHub)" <gi...@apache.org>.
clintropolis commented on code in PR #14065:
URL: https://github.com/apache/druid/pull/14065#discussion_r1169346651


##########
docs/ingestion/ingestion-spec.md:
##########
@@ -186,20 +186,28 @@ Treat `__time` as a millisecond timestamp: the number of milliseconds since Jan
 ### `dimensionsSpec`
 
 The `dimensionsSpec` is located in `dataSchema` → `dimensionsSpec` and is responsible for
-configuring [dimensions](./data-model.md#dimensions). An example `dimensionsSpec` is:
+configuring [dimensions](./data-model.md#dimensions). 
 
-```
+You can either manually specify the dimensions or take advantage of Schema auto-discovery where you allow Druid to infer all or some of the schema for your data. This means that you don't have to explicitly specify your dimensions and their type. 
+
+To use Schema auto-discovery, set `useSchemaDiscovery` to `true`. 

Review Comment:
   I suppose we should continue to mention here that the classic string based mode can be used by leaving the dimensions list empty, or setting `includeAllDimension` but not `useSchemaDiscovery`.
   
   `useSchemaDiscovery` = always use new 'auto' column indexers for any column that is discovery during ingestion
   empty dimension list OR `includeAllDimensions` = classic string based schema discovery



##########
docs/ingestion/schema-design.md:
##########
@@ -241,12 +241,11 @@ You should query for the number of ingested rows with:
 ]
 ```
 
-### Schema-less dimensions
+### Schema auto-discovery for dimensions
 
-If the `dimensions` field is left empty in your ingestion spec, Druid will treat every column that is not the timestamp column,
-a dimension that has been excluded, or a metric column as a dimension.
+You can have Druid infer the schema for your data partially or fully by setting `dimensionsSpec.useSchemaDiscovery` to `true` and defining some or no dimensions in the dimensions list. For any dimensions that aren't a uniform type, Druid ingests them as JSON.
 
-Note that when using schema-less ingestion, all dimensions will be ingested as String-typed dimensions.
+If you do not set `dimensionsSpec.useSchemaDiscovery` to `true` but do set `includeAllDimensions` to `true` with an empty dimensions list, Druid will ingest all columns as strings.

Review Comment:
   Druid will also do the older string based schema discovery by just leaving the dimensions list empty. `includeAllDimensions` allows the string based mode to _also_ work with partial schema declaration, otherwise the only way to enable this mode is with an empty dimension list (and not setting the new `useSchemaDiscovery` to true)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] ektravel commented on a diff in pull request #14065: docs: add docs for schema auto-discovery

Posted by "ektravel (via GitHub)" <gi...@apache.org>.
ektravel commented on code in PR #14065:
URL: https://github.com/apache/druid/pull/14065#discussion_r1190217508


##########
docs/configuration/index.md:
##########
@@ -1520,7 +1520,7 @@ Additional peon configs include:
 |`druid.indexer.task.hadoopWorkingPath`|Temporary working directory for Hadoop tasks.|`/tmp/druid-indexing`|
 |`druid.indexer.task.restoreTasksOnRestart`|If true, MiddleManagers will attempt to stop tasks gracefully on shutdown and restore them on restart.|false|
 |`druid.indexer.task.ignoreTimestampSpecForDruidInputSource`|If true, tasks using the [Druid input source](../ingestion/native-batch-input-source.md) will ignore the provided timestampSpec, and will use the `__time` column of the input datasource. This option is provided for compatibility with ingestion specs written before Druid 0.22.0.|false|
-|`druid.indexer.task.storeEmptyColumns`|Boolean value for whether or not to store empty columns during ingestion. When set to true, Druid stores every column specified in the [`dimensionsSpec`](../ingestion/ingestion-spec.md#dimensionsspec). If you use schemaless ingestion and don't specify any dimensions to ingest, you must also set [`includeAllDimensions`](../ingestion/ingestion-spec.md#dimensionsspec) for Druid to store empty columns.<br/><br/>If you set `storeEmptyColumns` to false, Druid SQL queries referencing empty columns will fail. If you intend to leave `storeEmptyColumns` disabled, you should either ingest dummy data for empty columns or else not query on empty columns.<br/><br/>This configuration can be overwritten by setting `storeEmptyColumns` in the [task context](../ingestion/tasks.md#context-parameters).|true|
+|`druid.indexer.task.storeEmptyColumns`|Boolean value for whether or not to store empty columns during ingestion. When set to true, Druid stores every column specified in the [`dimensionsSpec`](../ingestion/ingestion-spec.md#dimensionsspec). If you use the string-based schemaless ingestion and don't specify any dimensions to ingest, you must also set [`includeAllDimensions`](../ingestion/ingestion-spec.md#dimensionsspec) for Druid to store empty columns.<br/><br/>If you set `storeEmptyColumns` to false, Druid SQL queries referencing empty columns will fail. If you intend to leave `storeEmptyColumns` disabled, you should either ingest dummy data for empty columns or else not query on empty columns.<br/><br/>This configuration can be overwritten by setting `storeEmptyColumns` in the [task context](../ingestion/tasks.md#context-parameters).|true|

Review Comment:
   ```suggestion
   |`druid.indexer.task.storeEmptyColumns`|Boolean value for whether or not to store empty columns during ingestion. When set to true, Druid stores every column specified in the [`dimensionsSpec`](../ingestion/ingestion-spec.md#dimensionsspec). If you use the string-based schemaless ingestion and don't specify any dimensions to ingest, you must also set [`includeAllDimensions`](../ingestion/ingestion-spec.md#dimensionsspec) for Druid to store empty columns.<br/><br/>If you set `storeEmptyColumns` to false, Druid SQL queries referencing empty columns will fail. If you intend to leave `storeEmptyColumns` disabled, you should either ingest dummy data for empty columns or else not query on empty columns.<br/><br/>You can overwrite this configuration  by setting `storeEmptyColumns` in the [task context](../ingestion/tasks.md#context-parameters).|true|
   ```



##########
docs/configuration/index.md:
##########
@@ -1520,7 +1520,7 @@ Additional peon configs include:
 |`druid.indexer.task.hadoopWorkingPath`|Temporary working directory for Hadoop tasks.|`/tmp/druid-indexing`|
 |`druid.indexer.task.restoreTasksOnRestart`|If true, MiddleManagers will attempt to stop tasks gracefully on shutdown and restore them on restart.|false|
 |`druid.indexer.task.ignoreTimestampSpecForDruidInputSource`|If true, tasks using the [Druid input source](../ingestion/native-batch-input-source.md) will ignore the provided timestampSpec, and will use the `__time` column of the input datasource. This option is provided for compatibility with ingestion specs written before Druid 0.22.0.|false|
-|`druid.indexer.task.storeEmptyColumns`|Boolean value for whether or not to store empty columns during ingestion. When set to true, Druid stores every column specified in the [`dimensionsSpec`](../ingestion/ingestion-spec.md#dimensionsspec). If you use schemaless ingestion and don't specify any dimensions to ingest, you must also set [`includeAllDimensions`](../ingestion/ingestion-spec.md#dimensionsspec) for Druid to store empty columns.<br/><br/>If you set `storeEmptyColumns` to false, Druid SQL queries referencing empty columns will fail. If you intend to leave `storeEmptyColumns` disabled, you should either ingest dummy data for empty columns or else not query on empty columns.<br/><br/>This configuration can be overwritten by setting `storeEmptyColumns` in the [task context](../ingestion/tasks.md#context-parameters).|true|
+|`druid.indexer.task.storeEmptyColumns`|Boolean value for whether or not to store empty columns during ingestion. When set to true, Druid stores every column specified in the [`dimensionsSpec`](../ingestion/ingestion-spec.md#dimensionsspec). If you use the string-based schemaless ingestion and don't specify any dimensions to ingest, you must also set [`includeAllDimensions`](../ingestion/ingestion-spec.md#dimensionsspec) for Druid to store empty columns.<br/><br/>If you set `storeEmptyColumns` to false, Druid SQL queries referencing empty columns will fail. If you intend to leave `storeEmptyColumns` disabled, you should either ingest dummy data for empty columns or else not query on empty columns.<br/><br/>This configuration can be overwritten by setting `storeEmptyColumns` in the [task context](../ingestion/tasks.md#context-parameters).|true|

Review Comment:
   ```suggestion
   |`druid.indexer.task.storeEmptyColumns`|Boolean value for whether or not to store empty columns during ingestion. When set to true, Druid stores every column specified in the [`dimensionsSpec`](../ingestion/ingestion-spec.md#dimensionsspec). If you use the string-based schemaless ingestion and don't specify any dimensions to ingest, you must also set [`includeAllDimensions`](../ingestion/ingestion-spec.md#dimensionsspec) for Druid to store empty columns.<br/><br/>If you set `storeEmptyColumns` to false, Druid SQL queries referencing empty columns will fail. If you intend to leave `storeEmptyColumns` disabled, you should either ingest dummy data for empty columns or else not query on empty columns.<br/><br/>You can overwrite this configuration  by setting `storeEmptyColumns` in the [task context](../ingestion/tasks.md#context-parameters).|true|
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] ektravel commented on a diff in pull request #14065: docs: add docs for schema auto-discovery

Posted by "ektravel (via GitHub)" <gi...@apache.org>.
ektravel commented on code in PR #14065:
URL: https://github.com/apache/druid/pull/14065#discussion_r1190220650


##########
docs/ingestion/ingestion-spec.md:
##########
@@ -186,20 +186,32 @@ Treat `__time` as a millisecond timestamp: the number of milliseconds since Jan
 ### `dimensionsSpec`
 
 The `dimensionsSpec` is located in `dataSchema` → `dimensionsSpec` and is responsible for
-configuring [dimensions](./data-model.md#dimensions). An example `dimensionsSpec` is:
+configuring [dimensions](./data-model.md#dimensions). 
 
-```
+You can either manually specify the dimensions or take advantage of schema auto-discovery where you allow Druid to infer all or some of the schema for your data. This means that you don't have to explicitly specify your dimensions and their type. 
+
+To use schema auto-discovery, set `useSchemaDiscovery` to `true`. 
+
+Alternatively, you can use the string-based schemaless ingestion where any discovered dimensions are treated as strings. To do so, leave `useSchemaDiscovery` set to `false` (default). Then, set the dimensions list to empty or set the  `includeAllDimensions` property to `true`.
+
+The following `dimensionsSpec` example uses schema auto-discovery (`"useSchemaDiscovery": true`) in conjunction with explicitly defined dimensions to have Druid infer some of the schema for the data::

Review Comment:
   ```suggestion
   The following `dimensionsSpec` example uses schema auto-discovery (`"useSchemaDiscovery": true`) in conjunction with explicitly defined dimensions to have Druid infer some of the schema for the data:
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] ektravel commented on a diff in pull request #14065: docs: add docs for schema auto-discovery

Posted by "ektravel (via GitHub)" <gi...@apache.org>.
ektravel commented on code in PR #14065:
URL: https://github.com/apache/druid/pull/14065#discussion_r1190223858


##########
docs/ingestion/ingestion-spec.md:
##########
@@ -223,7 +237,7 @@ Dimension objects can have the following components:
 
 | Field | Description | Default |
 |-------|-------------|---------|
-| type | Either `string`, `long`, `float`, `double`, or `json`. | `string` |
+| type | Either `auto`, `string`, `long`, `float`, `double`, or `json`. For the `auto` type, Druid determines the most appropriate type for the dimension and assigns one of the following: STRING, ARRAY<STRING>, LONG, ARRAY<LONG>, DOUBLE, ARRAY<DOUBLE>, or COMPLEX<json> columns, all sharing a common 'nested' format. When you Druid infers the schema with schema auto-discovery, the type will be `auto`. | `string` |

Review Comment:
   ```suggestion
   | type | Either `auto`, `string`, `long`, `float`, `double`, or `json`. For the `auto` type, Druid determines the most appropriate type for the dimension and assigns one of the following: STRING, ARRAY<STRING>, LONG, ARRAY<LONG>, DOUBLE, ARRAY<DOUBLE>, or COMPLEX<json> columns, all sharing a common 'nested' format. When Druid infers the schema with schema auto-discovery, the type is `auto`. | `string` |
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] ektravel commented on a diff in pull request #14065: docs: add docs for schema auto-discovery

Posted by "ektravel (via GitHub)" <gi...@apache.org>.
ektravel commented on code in PR #14065:
URL: https://github.com/apache/druid/pull/14065#discussion_r1190232974


##########
docs/ingestion/schema-design.md:
##########
@@ -241,12 +240,45 @@ You should query for the number of ingested rows with:
 ]
 ```
 
-### Schema-less dimensions
+### Schema auto-discovery for dimensions
+
+Druid can infer the schema for your data in one of two ways:
+
+- [Type-aware schema discovery (experimental)](#type-aware-schema-discovery-experimental) where Druid infers the schema and type for your data. Type-aware schema discovery is experimental currently available for native batch and streaming ingestion.
+- [String-based schema discovery](#string-based-schema-discovery) where all the discovered columns are typed as either native string or multi-value string columns.
+
+#### Type-aware schema discovery
+
+> Note that using type-aware schema discovery can impact downstream BI tools depending on how they handle ARRAY typed columns..
+
+You can have Druid infer the schema and types for your data partially or fully by setting `dimensionsSpec.useSchemaDiscovery` to `true` and defining some or no dimensions in the dimensions list. 
+
+When performing type-aware schema discovery, Druid can discover all of the columns of your input data (that aren't in the exclusion list), but the behavior depends on the source data:
+
+- For primitive types, Druid ingests them as the most appropriate native Druid type, either string, double or long. 
+- For input formats with native boolean types will be ingested as strings or longs depending on the value of `druid.expressions.useStrictBooleans`.
+- For arrays, the data gets ingested as an array typed column: `ARRAY<STRING>`, `ARRAY<DOUBLE>`, or `ARRAY<LONG>`. You can then interact with these columns using ARRAY functions as well as UNNEST.
+- For mixed types, complex nested structures, and arrays of complex nested structures, Druid ingests the column as the least restrictive data type.
+
+If you're already using string-based schema discovery and want to migrate, see [Migrating to type-aware schema discovery](#migrating-to-type-aware-schema-discovery).
+
+#### String-based schema discovery
+
+If you do not set `dimensionsSpec.useSchemaDiscovery` to `true`, Druid can still use the string-based schema discovery for ingestion if any of the following conditions are met: 
+
+- The dimension list is empty 
+- You set `includeAllDimensions` to `true` 
+
+Druid coerces primitives and arrays of primitive types into the native Druid string type. Nested data structures and arrays of nested data structures are ignored and not ingested.
+
+#### Migrating to type-aware schema discovery
 
-If the `dimensions` field is left empty in your ingestion spec, Druid will treat every column that is not the timestamp column,
-a dimension that has been excluded, or a metric column as a dimension.
+If you previously used string-based schema discovery and want to migrate to type-aware schema discovery, you'll need to do the following:

Review Comment:
   ```suggestion
   If you previously used string-based schema discovery and want to migrate to type-aware schema discovery, do the following:
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] ektravel commented on a diff in pull request #14065: docs: add docs for schema auto-discovery

Posted by "ektravel (via GitHub)" <gi...@apache.org>.
ektravel commented on code in PR #14065:
URL: https://github.com/apache/druid/pull/14065#discussion_r1190229563


##########
docs/ingestion/schema-design.md:
##########
@@ -241,12 +240,45 @@ You should query for the number of ingested rows with:
 ]
 ```
 
-### Schema-less dimensions
+### Schema auto-discovery for dimensions
+
+Druid can infer the schema for your data in one of two ways:
+
+- [Type-aware schema discovery (experimental)](#type-aware-schema-discovery-experimental) where Druid infers the schema and type for your data. Type-aware schema discovery is experimental currently available for native batch and streaming ingestion.
+- [String-based schema discovery](#string-based-schema-discovery) where all the discovered columns are typed as either native string or multi-value string columns.
+
+#### Type-aware schema discovery
+
+> Note that using type-aware schema discovery can impact downstream BI tools depending on how they handle ARRAY typed columns..
+
+You can have Druid infer the schema and types for your data partially or fully by setting `dimensionsSpec.useSchemaDiscovery` to `true` and defining some or no dimensions in the dimensions list. 
+
+When performing type-aware schema discovery, Druid can discover all of the columns of your input data (that aren't in the exclusion list), but the behavior depends on the source data:
+
+- For primitive types, Druid ingests them as the most appropriate native Druid type, either string, double or long. 
+- For input formats with native boolean types will be ingested as strings or longs depending on the value of `druid.expressions.useStrictBooleans`.

Review Comment:
   ```suggestion
   - For input formats with native boolean types, Druid ingests them as strings or longs depending on the value of `druid.expressions.useStrictBooleans`.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] ektravel commented on a diff in pull request #14065: docs: add docs for schema auto-discovery

Posted by "ektravel (via GitHub)" <gi...@apache.org>.
ektravel commented on code in PR #14065:
URL: https://github.com/apache/druid/pull/14065#discussion_r1190227896


##########
docs/ingestion/schema-design.md:
##########
@@ -241,12 +240,45 @@ You should query for the number of ingested rows with:
 ]
 ```
 
-### Schema-less dimensions
+### Schema auto-discovery for dimensions
+
+Druid can infer the schema for your data in one of two ways:
+
+- [Type-aware schema discovery (experimental)](#type-aware-schema-discovery-experimental) where Druid infers the schema and type for your data. Type-aware schema discovery is experimental currently available for native batch and streaming ingestion.
+- [String-based schema discovery](#string-based-schema-discovery) where all the discovered columns are typed as either native string or multi-value string columns.
+
+#### Type-aware schema discovery
+
+> Note that using type-aware schema discovery can impact downstream BI tools depending on how they handle ARRAY typed columns..
+
+You can have Druid infer the schema and types for your data partially or fully by setting `dimensionsSpec.useSchemaDiscovery` to `true` and defining some or no dimensions in the dimensions list. 
+
+When performing type-aware schema discovery, Druid can discover all of the columns of your input data (that aren't in the exclusion list), but the behavior depends on the source data:
+
+- For primitive types, Druid ingests them as the most appropriate native Druid type, either string, double or long. 

Review Comment:
   ```suggestion
   - **Primitive types**: Druid ingests them as the most appropriate native Druid type, either string, double or long. 
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] ektravel commented on a diff in pull request #14065: docs: add docs for schema auto-discovery

Posted by "ektravel (via GitHub)" <gi...@apache.org>.
ektravel commented on code in PR #14065:
URL: https://github.com/apache/druid/pull/14065#discussion_r1190225019


##########
docs/ingestion/schema-design.md:
##########
@@ -120,8 +120,7 @@ you must be more explicit. Druid columns have types specific upfront.
 
 Tips for modeling log data in Druid:
 
-* If you don't know ahead of time what columns you'll want to ingest, use an empty dimensions list to trigger
-[automatic detection of dimension columns](#schema-less-dimensions).
+* If you don't know ahead of time what columns you'll want to ingest, you can have Druid perform [schema auto-discovery](#schema-auto-discovery-for-dimensions).

Review Comment:
   ```suggestion
   * If you don't know ahead of time what columns to ingest, you can have Druid perform [schema auto-discovery](#schema-auto-discovery-for-dimensions).
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] ektravel commented on a diff in pull request #14065: docs: add docs for schema auto-discovery

Posted by "ektravel (via GitHub)" <gi...@apache.org>.
ektravel commented on code in PR #14065:
URL: https://github.com/apache/druid/pull/14065#discussion_r1190217508


##########
docs/configuration/index.md:
##########
@@ -1520,7 +1520,7 @@ Additional peon configs include:
 |`druid.indexer.task.hadoopWorkingPath`|Temporary working directory for Hadoop tasks.|`/tmp/druid-indexing`|
 |`druid.indexer.task.restoreTasksOnRestart`|If true, MiddleManagers will attempt to stop tasks gracefully on shutdown and restore them on restart.|false|
 |`druid.indexer.task.ignoreTimestampSpecForDruidInputSource`|If true, tasks using the [Druid input source](../ingestion/native-batch-input-source.md) will ignore the provided timestampSpec, and will use the `__time` column of the input datasource. This option is provided for compatibility with ingestion specs written before Druid 0.22.0.|false|
-|`druid.indexer.task.storeEmptyColumns`|Boolean value for whether or not to store empty columns during ingestion. When set to true, Druid stores every column specified in the [`dimensionsSpec`](../ingestion/ingestion-spec.md#dimensionsspec). If you use schemaless ingestion and don't specify any dimensions to ingest, you must also set [`includeAllDimensions`](../ingestion/ingestion-spec.md#dimensionsspec) for Druid to store empty columns.<br/><br/>If you set `storeEmptyColumns` to false, Druid SQL queries referencing empty columns will fail. If you intend to leave `storeEmptyColumns` disabled, you should either ingest dummy data for empty columns or else not query on empty columns.<br/><br/>This configuration can be overwritten by setting `storeEmptyColumns` in the [task context](../ingestion/tasks.md#context-parameters).|true|
+|`druid.indexer.task.storeEmptyColumns`|Boolean value for whether or not to store empty columns during ingestion. When set to true, Druid stores every column specified in the [`dimensionsSpec`](../ingestion/ingestion-spec.md#dimensionsspec). If you use the string-based schemaless ingestion and don't specify any dimensions to ingest, you must also set [`includeAllDimensions`](../ingestion/ingestion-spec.md#dimensionsspec) for Druid to store empty columns.<br/><br/>If you set `storeEmptyColumns` to false, Druid SQL queries referencing empty columns will fail. If you intend to leave `storeEmptyColumns` disabled, you should either ingest dummy data for empty columns or else not query on empty columns.<br/><br/>This configuration can be overwritten by setting `storeEmptyColumns` in the [task context](../ingestion/tasks.md#context-parameters).|true|

Review Comment:
   ```suggestion
   |`druid.indexer.task.storeEmptyColumns`|Boolean value for whether or not to store empty columns during ingestion. When set to true, Druid stores every column specified in the [`dimensionsSpec`](../ingestion/ingestion-spec.md#dimensionsspec). If you use the string-based schemaless ingestion and don't specify any dimensions to ingest, you must also set [`includeAllDimensions`](../ingestion/ingestion-spec.md#dimensionsspec) for Druid to store empty columns.<br/><br/>If you set `storeEmptyColumns` to false, Druid SQL queries referencing empty columns will fail. If you intend to leave `storeEmptyColumns` disabled, you should either ingest placeholder data for empty columns or else not query on empty columns.<br/><br/>You can overwrite this configuration  by setting `storeEmptyColumns` in the [task context](../ingestion/tasks.md#context-parameters).|true|
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] ektravel commented on a diff in pull request #14065: docs: add docs for schema auto-discovery

Posted by "ektravel (via GitHub)" <gi...@apache.org>.
ektravel commented on code in PR #14065:
URL: https://github.com/apache/druid/pull/14065#discussion_r1190218152


##########
docs/configuration/index.md:
##########
@@ -1588,7 +1588,7 @@ then the value from the configuration below is used:
 |`druid.indexer.task.hadoopWorkingPath`|Temporary working directory for Hadoop tasks.|`/tmp/druid-indexing`|
 |`druid.indexer.task.restoreTasksOnRestart`|If true, the Indexer will attempt to stop tasks gracefully on shutdown and restore them on restart.|false|
 |`druid.indexer.task.ignoreTimestampSpecForDruidInputSource`|If true, tasks using the [Druid input source](../ingestion/native-batch-input-source.md) will ignore the provided timestampSpec, and will use the `__time` column of the input datasource. This option is provided for compatibility with ingestion specs written before Druid 0.22.0.|false|
-|`druid.indexer.task.storeEmptyColumns`|Boolean value for whether or not to store empty columns during ingestion. When set to true, Druid stores every column specified in the [`dimensionsSpec`](../ingestion/ingestion-spec.md#dimensionsspec). If you use schemaless ingestion and don't specify any dimensions to ingest, you must also set [`includeAllDimensions`](../ingestion/ingestion-spec.md#dimensionsspec) for Druid to store empty columns.<br/><br/>If you set `storeEmptyColumns` to false, Druid SQL queries referencing empty columns will fail. If you intend to leave `storeEmptyColumns` disabled, you should either ingest dummy data for empty columns or else not query on empty columns.<br/><br/>This configuration can be overwritten by setting `storeEmptyColumns` in the [task context](../ingestion/tasks.md#context-parameters).|true|
+|`druid.indexer.task.storeEmptyColumns`|Boolean value for whether or not to store empty columns during ingestion. When set to true, Druid stores every column specified in the [`dimensionsSpec`](../ingestion/ingestion-spec.md#dimensionsspec). <br/><br/>If you set `storeEmptyColumns` to false, Druid SQL queries referencing empty columns will fail. If you intend to leave `storeEmptyColumns` disabled, you should either ingest dummy data for empty columns or else not query on empty columns.<br/><br/>This configuration can be overwritten by setting `storeEmptyColumns` in the [task context](../ingestion/tasks.md#context-parameters).|true|

Review Comment:
   ```suggestion
   |`druid.indexer.task.storeEmptyColumns`|Boolean value for whether or not to store empty columns during ingestion. When set to true, Druid stores every column specified in the [`dimensionsSpec`](../ingestion/ingestion-spec.md#dimensionsspec). <br/><br/>If you set `storeEmptyColumns` to false, Druid SQL queries referencing empty columns will fail. If you intend to leave `storeEmptyColumns` disabled, you should either ingest placeholder data for empty columns or else not query on empty columns.<br/><br/>You can overwrite this configuration by setting `storeEmptyColumns` in the [task context](../ingestion/tasks.md#context-parameters).|true|
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] 317brian commented on a diff in pull request #14065: docs: add docs for schema auto-discovery

Posted by "317brian (via GitHub)" <gi...@apache.org>.
317brian commented on code in PR #14065:
URL: https://github.com/apache/druid/pull/14065#discussion_r1190328080


##########
docs/configuration/index.md:
##########
@@ -1520,7 +1520,7 @@ Additional peon configs include:
 |`druid.indexer.task.hadoopWorkingPath`|Temporary working directory for Hadoop tasks.|`/tmp/druid-indexing`|
 |`druid.indexer.task.restoreTasksOnRestart`|If true, MiddleManagers will attempt to stop tasks gracefully on shutdown and restore them on restart.|false|
 |`druid.indexer.task.ignoreTimestampSpecForDruidInputSource`|If true, tasks using the [Druid input source](../ingestion/native-batch-input-source.md) will ignore the provided timestampSpec, and will use the `__time` column of the input datasource. This option is provided for compatibility with ingestion specs written before Druid 0.22.0.|false|
-|`druid.indexer.task.storeEmptyColumns`|Boolean value for whether or not to store empty columns during ingestion. When set to true, Druid stores every column specified in the [`dimensionsSpec`](../ingestion/ingestion-spec.md#dimensionsspec). If you use schemaless ingestion and don't specify any dimensions to ingest, you must also set [`includeAllDimensions`](../ingestion/ingestion-spec.md#dimensionsspec) for Druid to store empty columns.<br/><br/>If you set `storeEmptyColumns` to false, Druid SQL queries referencing empty columns will fail. If you intend to leave `storeEmptyColumns` disabled, you should either ingest dummy data for empty columns or else not query on empty columns.<br/><br/>This configuration can be overwritten by setting `storeEmptyColumns` in the [task context](../ingestion/tasks.md#context-parameters).|true|
+|`druid.indexer.task.storeEmptyColumns`|Boolean value for whether or not to store empty columns during ingestion. When set to true, Druid stores every column specified in the [`dimensionsSpec`](../ingestion/ingestion-spec.md#dimensionsspec). If you use the string-based schemaless ingestion and don't specify any dimensions to ingest, you must also set [`includeAllDimensions`](../ingestion/ingestion-spec.md#dimensionsspec) for Druid to store empty columns.<br/><br/>If you set `storeEmptyColumns` to false, Druid SQL queries referencing empty columns will fail. If you intend to leave `storeEmptyColumns` disabled, you should either ingest dummy data for empty columns or else not query on empty columns.<br/><br/>This configuration can be overwritten by setting `storeEmptyColumns` in the [task context](../ingestion/tasks.md#context-parameters).|true|

Review Comment:
   @ektravel Can you update your suggestion to replace "dummy," then I can commit it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on a diff in pull request #14065: docs: add docs for schema auto-discovery

Posted by "clintropolis (via GitHub)" <gi...@apache.org>.
clintropolis commented on code in PR #14065:
URL: https://github.com/apache/druid/pull/14065#discussion_r1192225392


##########
website/.spelling:
##########
@@ -307,6 +307,7 @@ firefox
 firehose
 firehoses
 fromPigAvroStorage
+frontcoded

Review Comment:
   is this needed?



##########
docs/ingestion/schema-design.md:
##########
@@ -241,12 +240,46 @@ You should query for the number of ingested rows with:
 ]
 ```
 
-### Schema-less dimensions
+### Schema auto-discovery for dimensions
+
+Druid can infer the schema for your data in one of two ways:
+
+- [Type-aware schema discovery (experimental)](#type-aware-schema-discovery-experimental) where Druid infers the schema and type for your data. Type-aware schema discovery is an experimental feature currently available for native batch and streaming ingestion.
+- [String-based schema discovery](#string-based-schema-discovery) where all the discovered columns are typed as either native string or multi-value string columns.
+
+#### Type-aware schema discovery
+
+> Note that using type-aware schema discovery can impact downstream BI tools depending on how they handle ARRAY typed columns.
+
+You can have Druid infer the schema and types for your data partially or fully by setting `dimensionsSpec.useSchemaDiscovery` to `true` and defining some or no dimensions in the dimensions list. 
+
+When performing type-aware schema discovery, Druid can discover all of the columns of your input data (that aren't in the exclusion list), but the behavior depends on the source data:
+
+- **Primitive types**: Ingested as the most appropriate native Druid type, either string, double or long. 
+- **Native boolean types**:  Ingested as strings or longs depending on the value of `druid.expressions.useStrictBooleans`.
+- For input formats with native boolean types, Druid ingests them as strings or longs depending on the value of `druid.expressions.useStrictBooleans`.
+- For arrays, the data gets ingested as an array typed column: `ARRAY<STRING>`, `ARRAY<DOUBLE>`, or `ARRAY<LONG>`. You can then interact with these columns using ARRAY functions as well as UNNEST.
+- For mixed types, complex nested structures, and arrays of complex nested structures, Druid ingests the column as the least restrictive data type.

Review Comment:
   tried to clarify behavior a bit:
   
   ```suggestion
   When performing type-aware schema discovery, Druid can discover all of the columns of your input data (that aren't in
   the exclusion list). Druid automatically chooses the most appropriate native Druid type among `STRING`, `LONG`,
   `DOUBLE`, `ARRAY<STRING>`, `ARRAY<LONG>`, `ARRAY<DOUBLE>`, or `COMPLEX<json>` for nested data. For input formats with
   native boolean types, Druid ingests these values as strings if `druid.expressions.useStrictBooleans` is set to `false`
   (the default), or longs if set to `true` (for more SQL compatible behavior). Array typed columns can be queried using
   the [array functions](../querying/sql-array-functions.md) or [`UNNEST`](../querying/sql-functions.md#unnest). Nested
   columns can be queried with the [json functions](../querying/sql-json-functions.md).
   
   Mixed type columns are stored in the 'least' restrictive type that can represent all values in the column. For example,
   mixed numeric columns are `DOUBLE`, if there are any strings present then the column is a `STRING`, if there are arrays
   then the column becomes an array with the least restrictive element type, and finally any nested data or arrays of
   nested data will be stored as `COMPLEX<json>` nested columns.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] 317brian commented on a diff in pull request #14065: docs: add docs for schema auto-discovery

Posted by "317brian (via GitHub)" <gi...@apache.org>.
317brian commented on code in PR #14065:
URL: https://github.com/apache/druid/pull/14065#discussion_r1169029778


##########
docs/ingestion/ingestion-spec.md:
##########
@@ -200,6 +212,23 @@ configuring [dimensions](./data-model.md#dimensions). An example `dimensionsSpec
 }
 ```
 
+<!--Schema auto-discovery-->
+
+```json
+    "tuningConfig": {
+      ...
+      "appendableIndexSpec": {
+        ...,
+        "useSchemaDiscovery": true
+      },
+    },

Review Comment:
   Is it still a child of `appendableIndexSpec`, ie:
   `dimensionsSpec.appendableIndexSpec.useSchemaDiscovery`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on a diff in pull request #14065: docs: add docs for schema auto-discovery

Posted by "clintropolis (via GitHub)" <gi...@apache.org>.
clintropolis commented on code in PR #14065:
URL: https://github.com/apache/druid/pull/14065#discussion_r1163963390


##########
docs/ingestion/ingestion-spec.md:
##########
@@ -212,7 +241,7 @@ A `dimensionsSpec` can have the following components:
 | `dimensions`           | A list of [dimension names or objects](#dimension-objects). You cannot include the same column in both `dimensions` and `dimensionExclusions`.<br /><br />If `dimensions` and `spatialDimensions` are both null or empty arrays, Druid treats all columns other than timestamp or metrics that do not appear in `dimensionExclusions` as String-typed dimension columns. See [inclusions and exclusions](#inclusions-and-exclusions) for details.<br /><br />As a best practice, put the most frequently filtered dimensions at the beginning of the dimensions list. In this case, it would also be good to consider [`partitioning`](partitioning.md) by those same dimensions.                                                                                                                                                                                                                                  | `[]`    |
 | `dimensionExclusions`  | The names of dimensions to exclude from ingestion. Only names are supported here, not objects.<br /><br />This list is only used if the `dimensions` and `spatialDimensions` lists are both null or empty arrays; otherwise it is ignored. See [inclusions and exclusions](#inclusions-and-exclusions) below for details.                                                                                                                                                                                                                                                                                                                                               | `[]`    |
 | `spatialDimensions`    | An array of [spatial dimensions](../development/geo.md).                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | `[]`    |
-| `includeAllDimensions` | You can set `includeAllDimensions` to true to ingest both explicit dimensions in the `dimensions` field and other dimensions that the ingestion task discovers from input data. In this case, the explicit dimensions will appear first in order that you specify them and the dimensions dynamically discovered will come after. This flag can be useful especially with auto schema discovery using [`flattenSpec`](./data-formats.html#flattenspec). If this is not set and the `dimensions` field is not empty, Druid will ingest only explicit dimensions. If this is not set and the `dimensions` field is empty, all discovered dimensions will be ingested. | false   |

Review Comment:
   this unfortunately hasn't actually been removed yet. I do want to consolidate it into `useSchemaDiscovery`, but haven't yet



##########
docs/ingestion/ingestion-spec.md:
##########
@@ -200,6 +212,23 @@ configuring [dimensions](./data-model.md#dimensions). An example `dimensionsSpec
 }
 ```
 
+<!--Schema auto-discovery-->
+
+```json
+    "tuningConfig": {
+      ...
+      "appendableIndexSpec": {
+        ...,
+        "useSchemaDiscovery": true
+      },
+    },

Review Comment:
   oops, the old PR descriptions are stale, `useSchemaDiscovery` is on `dimensionsSpec` these days



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] vogievetsky commented on a diff in pull request #14065: docs: add docs for schema auto-discovery

Posted by "vogievetsky (via GitHub)" <gi...@apache.org>.
vogievetsky commented on code in PR #14065:
URL: https://github.com/apache/druid/pull/14065#discussion_r1186922589


##########
docs/ingestion/schema-design.md:
##########
@@ -241,12 +240,45 @@ You should query for the number of ingested rows with:
 ]
 ```
 
-### Schema-less dimensions
+### Schema auto-discovery for dimensions
+
+Druid can infer the schema for your data in one of two ways:
+
+- [Type-aware schema discovery (experimental)](#type-aware-schema-discovery-experimental) where Druid infers the schema and type for your data. Type-aware schema discovery is experimental currently available for native batch and streaming ingestion.
+- [String-based schema discovery](#string-based-schema-discovery) where all the discovered columns are typed as either native string or multi-value string columns.
+
+#### Type-aware schema discovery (experimental)

Review Comment:
   I think it is better not to put `(experimental)` in the heading itself as later, when we remove it we will have to remember to add a redirect. Better to just mention it in the warning text below



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] 317brian commented on a diff in pull request #14065: docs: add docs for schema auto-discovery

Posted by "317brian (via GitHub)" <gi...@apache.org>.
317brian commented on code in PR #14065:
URL: https://github.com/apache/druid/pull/14065#discussion_r1187695066


##########
docs/ingestion/schema-design.md:
##########
@@ -241,12 +240,45 @@ You should query for the number of ingested rows with:
 ]
 ```
 
-### Schema-less dimensions
+### Schema auto-discovery for dimensions
+
+Druid can infer the schema for your data in one of two ways:
+
+- [Type-aware schema discovery (experimental)](#type-aware-schema-discovery-experimental) where Druid infers the schema and type for your data. Type-aware schema discovery is experimental currently available for native batch and streaming ingestion.
+- [String-based schema discovery](#string-based-schema-discovery) where all the discovered columns are typed as either native string or multi-value string columns.
+
+#### Type-aware schema discovery (experimental)

Review Comment:
   ```suggestion
   #### Type-aware schema discovery
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] ektravel commented on a diff in pull request #14065: docs: add docs for schema auto-discovery

Posted by "ektravel (via GitHub)" <gi...@apache.org>.
ektravel commented on code in PR #14065:
URL: https://github.com/apache/druid/pull/14065#discussion_r1190219280


##########
docs/ingestion/ingestion-spec.md:
##########
@@ -24,7 +24,7 @@ description: Reference for the configuration options in the ingestion spec.
   ~ under the License.
   -->
 
-All ingestion methods use ingestion tasks to load data into Druid. Streaming ingestion uses ongoing supervisors that run and supervise a set of tasks over time. Native batch and Hadoop-based ingestion use a one-time [task](tasks.md). All types of ingestion use an _ingestion spec_ to configure ingestion.
+All ingestion methods use ingestion tasks to load data into Druid. Streaming ingestion uses ongoing supervisors that run and supervise a set of tasks over time. Native batch and Hadoop-based ingestion use a one-time [task](tasks.md). Other than with SQL-based ingestion,  use an _ingestion spec_ to configure your ingestion.

Review Comment:
   ```suggestion
   All ingestion methods use ingestion tasks to load data into Druid. Streaming ingestion uses ongoing supervisors that run and supervise a set of tasks over time. Native batch and Hadoop-based ingestion use a one-time [task](tasks.md). Other than with SQL-based ingestion, use an _ingestion spec_ to configure your ingestion.
   ```



##########
docs/ingestion/ingestion-spec.md:
##########
@@ -24,7 +24,7 @@ description: Reference for the configuration options in the ingestion spec.
   ~ under the License.
   -->
 
-All ingestion methods use ingestion tasks to load data into Druid. Streaming ingestion uses ongoing supervisors that run and supervise a set of tasks over time. Native batch and Hadoop-based ingestion use a one-time [task](tasks.md). All types of ingestion use an _ingestion spec_ to configure ingestion.
+All ingestion methods use ingestion tasks to load data into Druid. Streaming ingestion uses ongoing supervisors that run and supervise a set of tasks over time. Native batch and Hadoop-based ingestion use a one-time [task](tasks.md). Other than with SQL-based ingestion,  use an _ingestion spec_ to configure your ingestion.

Review Comment:
   ```suggestion
   All ingestion methods use ingestion tasks to load data into Druid. Streaming ingestion uses ongoing supervisors that run and supervise a set of tasks over time. Native batch and Hadoop-based ingestion use a one-time [task](tasks.md). Other than with SQL-based ingestion, use an _ingestion spec_ to configure your ingestion.
   ```



##########
docs/ingestion/ingestion-spec.md:
##########
@@ -24,7 +24,7 @@ description: Reference for the configuration options in the ingestion spec.
   ~ under the License.
   -->
 
-All ingestion methods use ingestion tasks to load data into Druid. Streaming ingestion uses ongoing supervisors that run and supervise a set of tasks over time. Native batch and Hadoop-based ingestion use a one-time [task](tasks.md). All types of ingestion use an _ingestion spec_ to configure ingestion.
+All ingestion methods use ingestion tasks to load data into Druid. Streaming ingestion uses ongoing supervisors that run and supervise a set of tasks over time. Native batch and Hadoop-based ingestion use a one-time [task](tasks.md). Other than with SQL-based ingestion,  use an _ingestion spec_ to configure your ingestion.

Review Comment:
   ```suggestion
   All ingestion methods use ingestion tasks to load data into Druid. Streaming ingestion uses ongoing supervisors that run and supervise a set of tasks over time. Native batch and Hadoop-based ingestion use a one-time [task](tasks.md). Other than with SQL-based ingestion, use an _ingestion spec_ to configure your ingestion.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] ektravel commented on a diff in pull request #14065: docs: add docs for schema auto-discovery

Posted by "ektravel (via GitHub)" <gi...@apache.org>.
ektravel commented on code in PR #14065:
URL: https://github.com/apache/druid/pull/14065#discussion_r1190235891


##########
docs/ingestion/schema-design.md:
##########
@@ -241,12 +240,45 @@ You should query for the number of ingested rows with:
 ]
 ```
 
-### Schema-less dimensions
+### Schema auto-discovery for dimensions
+
+Druid can infer the schema for your data in one of two ways:
+
+- [Type-aware schema discovery (experimental)](#type-aware-schema-discovery-experimental) where Druid infers the schema and type for your data. Type-aware schema discovery is experimental currently available for native batch and streaming ingestion.
+- [String-based schema discovery](#string-based-schema-discovery) where all the discovered columns are typed as either native string or multi-value string columns.
+
+#### Type-aware schema discovery
+
+> Note that using type-aware schema discovery can impact downstream BI tools depending on how they handle ARRAY typed columns..
+
+You can have Druid infer the schema and types for your data partially or fully by setting `dimensionsSpec.useSchemaDiscovery` to `true` and defining some or no dimensions in the dimensions list. 
+
+When performing type-aware schema discovery, Druid can discover all of the columns of your input data (that aren't in the exclusion list), but the behavior depends on the source data:
+
+- For primitive types, Druid ingests them as the most appropriate native Druid type, either string, double or long. 
+- For input formats with native boolean types will be ingested as strings or longs depending on the value of `druid.expressions.useStrictBooleans`.
+- For arrays, the data gets ingested as an array typed column: `ARRAY<STRING>`, `ARRAY<DOUBLE>`, or `ARRAY<LONG>`. You can then interact with these columns using ARRAY functions as well as UNNEST.
+- For mixed types, complex nested structures, and arrays of complex nested structures, Druid ingests the column as the least restrictive data type.
+
+If you're already using string-based schema discovery and want to migrate, see [Migrating to type-aware schema discovery](#migrating-to-type-aware-schema-discovery).
+
+#### String-based schema discovery
+
+If you do not set `dimensionsSpec.useSchemaDiscovery` to `true`, Druid can still use the string-based schema discovery for ingestion if any of the following conditions are met: 
+
+- The dimension list is empty 
+- You set `includeAllDimensions` to `true` 
+
+Druid coerces primitives and arrays of primitive types into the native Druid string type. Nested data structures and arrays of nested data structures are ignored and not ingested.
+
+#### Migrating to type-aware schema discovery
 
-If the `dimensions` field is left empty in your ingestion spec, Druid will treat every column that is not the timestamp column,
-a dimension that has been excluded, or a metric column as a dimension.
+If you previously used string-based schema discovery and want to migrate to type-aware schema discovery, you'll need to do the following:
 
-Note that when using schema-less ingestion, all dimensions will be ingested as String-typed dimensions.
+- Update any queries that use multi-value dimensions (MVDs) to use UNNEST in conjunction with other functions so that no  MVD behavior is being relied upon. Type-aware schema discovery generates ARRAY typed columns instead of MVDs, so queries that use any MVD features will fail.
+- Be aware of mixed typed inputs and test how type-aware schema discovery handles them. Druid attempts to cast them as the least restrictive type.
+- If you notice issues with numeric types, you may need to explicitly cast them. Generally though, Druid will handle the coercion for you.
+- Update your dimension exclusion list and add any nested columns if you want to continue to exclude them. String-based schema discovery automatically ignored nested columns, but type-aware schema discovery will ingest them.

Review Comment:
   ```suggestion
   - Update your dimension exclusion list and add any nested columns if you want to continue to exclude them. String-based schema discovery automatically ignores nested columns, but type-aware schema discovery will ingest them.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on a diff in pull request #14065: docs: add docs for schema auto-discovery

Posted by "clintropolis (via GitHub)" <gi...@apache.org>.
clintropolis commented on code in PR #14065:
URL: https://github.com/apache/druid/pull/14065#discussion_r1165001246


##########
docs/ingestion/ingestion-spec.md:
##########
@@ -212,7 +241,7 @@ A `dimensionsSpec` can have the following components:
 | `dimensions`           | A list of [dimension names or objects](#dimension-objects). You cannot include the same column in both `dimensions` and `dimensionExclusions`.<br /><br />If `dimensions` and `spatialDimensions` are both null or empty arrays, Druid treats all columns other than timestamp or metrics that do not appear in `dimensionExclusions` as String-typed dimension columns. See [inclusions and exclusions](#inclusions-and-exclusions) for details.<br /><br />As a best practice, put the most frequently filtered dimensions at the beginning of the dimensions list. In this case, it would also be good to consider [`partitioning`](partitioning.md) by those same dimensions.                                                                                                                                                                                                                                  | `[]`    |
 | `dimensionExclusions`  | The names of dimensions to exclude from ingestion. Only names are supported here, not objects.<br /><br />This list is only used if the `dimensions` and `spatialDimensions` lists are both null or empty arrays; otherwise it is ignored. See [inclusions and exclusions](#inclusions-and-exclusions) below for details.                                                                                                                                                                                                                                                                                                                                               | `[]`    |
 | `spatialDimensions`    | An array of [spatial dimensions](../development/geo.md).                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | `[]`    |
-| `includeAllDimensions` | You can set `includeAllDimensions` to true to ingest both explicit dimensions in the `dimensions` field and other dimensions that the ingestion task discovers from input data. In this case, the explicit dimensions will appear first in order that you specify them and the dimensions dynamically discovered will come after. This flag can be useful especially with auto schema discovery using [`flattenSpec`](./data-formats.html#flattenspec). If this is not set and the `dimensions` field is not empty, Druid will ingest only explicit dimensions. If this is not set and the `dimensions` field is empty, all discovered dimensions will be ingested. | false   |

Review Comment:
   this inspired me to open #14076 which makes `useSchemaDiscovery` pick up the behavior of `includeAllDimensions` so that we can reduce the number of flags required to use partial schema definition + schema discovery using the new stuff



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] ektravel commented on a diff in pull request #14065: docs: add docs for schema auto-discovery

Posted by "ektravel (via GitHub)" <gi...@apache.org>.
ektravel commented on code in PR #14065:
URL: https://github.com/apache/druid/pull/14065#discussion_r1190231368


##########
docs/ingestion/schema-design.md:
##########
@@ -241,12 +240,45 @@ You should query for the number of ingested rows with:
 ]
 ```
 
-### Schema-less dimensions
+### Schema auto-discovery for dimensions
+
+Druid can infer the schema for your data in one of two ways:
+
+- [Type-aware schema discovery (experimental)](#type-aware-schema-discovery-experimental) where Druid infers the schema and type for your data. Type-aware schema discovery is experimental currently available for native batch and streaming ingestion.
+- [String-based schema discovery](#string-based-schema-discovery) where all the discovered columns are typed as either native string or multi-value string columns.
+
+#### Type-aware schema discovery
+
+> Note that using type-aware schema discovery can impact downstream BI tools depending on how they handle ARRAY typed columns..
+
+You can have Druid infer the schema and types for your data partially or fully by setting `dimensionsSpec.useSchemaDiscovery` to `true` and defining some or no dimensions in the dimensions list. 
+
+When performing type-aware schema discovery, Druid can discover all of the columns of your input data (that aren't in the exclusion list), but the behavior depends on the source data:
+
+- For primitive types, Druid ingests them as the most appropriate native Druid type, either string, double or long. 

Review Comment:
   Consider formatting this list differently. For example:
   
   ```suggestion
   - **Primitive types**: Ingested as the most appropriate native Druid type, either string, double or long. 
   - **Native boolean types**:  Ingested as strings or longs depending on the value of `druid.expressions.useStrictBooleans`.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] ektravel commented on a diff in pull request #14065: docs: add docs for schema auto-discovery

Posted by "ektravel (via GitHub)" <gi...@apache.org>.
ektravel commented on code in PR #14065:
URL: https://github.com/apache/druid/pull/14065#discussion_r1190225886


##########
docs/ingestion/schema-design.md:
##########
@@ -241,12 +240,45 @@ You should query for the number of ingested rows with:
 ]
 ```
 
-### Schema-less dimensions
+### Schema auto-discovery for dimensions
+
+Druid can infer the schema for your data in one of two ways:
+
+- [Type-aware schema discovery (experimental)](#type-aware-schema-discovery-experimental) where Druid infers the schema and type for your data. Type-aware schema discovery is experimental currently available for native batch and streaming ingestion.
+- [String-based schema discovery](#string-based-schema-discovery) where all the discovered columns are typed as either native string or multi-value string columns.
+
+#### Type-aware schema discovery
+
+> Note that using type-aware schema discovery can impact downstream BI tools depending on how they handle ARRAY typed columns..

Review Comment:
   ```suggestion
   > Note that using type-aware schema discovery can impact downstream BI tools depending on how they handle ARRAY typed columns.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] ektravel commented on a diff in pull request #14065: docs: add docs for schema auto-discovery

Posted by "ektravel (via GitHub)" <gi...@apache.org>.
ektravel commented on code in PR #14065:
URL: https://github.com/apache/druid/pull/14065#discussion_r1190218152


##########
docs/configuration/index.md:
##########
@@ -1588,7 +1588,7 @@ then the value from the configuration below is used:
 |`druid.indexer.task.hadoopWorkingPath`|Temporary working directory for Hadoop tasks.|`/tmp/druid-indexing`|
 |`druid.indexer.task.restoreTasksOnRestart`|If true, the Indexer will attempt to stop tasks gracefully on shutdown and restore them on restart.|false|
 |`druid.indexer.task.ignoreTimestampSpecForDruidInputSource`|If true, tasks using the [Druid input source](../ingestion/native-batch-input-source.md) will ignore the provided timestampSpec, and will use the `__time` column of the input datasource. This option is provided for compatibility with ingestion specs written before Druid 0.22.0.|false|
-|`druid.indexer.task.storeEmptyColumns`|Boolean value for whether or not to store empty columns during ingestion. When set to true, Druid stores every column specified in the [`dimensionsSpec`](../ingestion/ingestion-spec.md#dimensionsspec). If you use schemaless ingestion and don't specify any dimensions to ingest, you must also set [`includeAllDimensions`](../ingestion/ingestion-spec.md#dimensionsspec) for Druid to store empty columns.<br/><br/>If you set `storeEmptyColumns` to false, Druid SQL queries referencing empty columns will fail. If you intend to leave `storeEmptyColumns` disabled, you should either ingest dummy data for empty columns or else not query on empty columns.<br/><br/>This configuration can be overwritten by setting `storeEmptyColumns` in the [task context](../ingestion/tasks.md#context-parameters).|true|
+|`druid.indexer.task.storeEmptyColumns`|Boolean value for whether or not to store empty columns during ingestion. When set to true, Druid stores every column specified in the [`dimensionsSpec`](../ingestion/ingestion-spec.md#dimensionsspec). <br/><br/>If you set `storeEmptyColumns` to false, Druid SQL queries referencing empty columns will fail. If you intend to leave `storeEmptyColumns` disabled, you should either ingest dummy data for empty columns or else not query on empty columns.<br/><br/>This configuration can be overwritten by setting `storeEmptyColumns` in the [task context](../ingestion/tasks.md#context-parameters).|true|

Review Comment:
   ```suggestion
   |`druid.indexer.task.storeEmptyColumns`|Boolean value for whether or not to store empty columns during ingestion. When set to true, Druid stores every column specified in the [`dimensionsSpec`](../ingestion/ingestion-spec.md#dimensionsspec). <br/><br/>If you set `storeEmptyColumns` to false, Druid SQL queries referencing empty columns will fail. If you intend to leave `storeEmptyColumns` disabled, you should either ingest dummy data for empty columns or else not query on empty columns.<br/><br/>You can overwrite this configuration by setting `storeEmptyColumns` in the [task context](../ingestion/tasks.md#context-parameters).|true|
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on a diff in pull request #14065: docs: add docs for schema auto-discovery

Posted by "clintropolis (via GitHub)" <gi...@apache.org>.
clintropolis commented on code in PR #14065:
URL: https://github.com/apache/druid/pull/14065#discussion_r1169089214


##########
docs/ingestion/ingestion-spec.md:
##########
@@ -200,6 +212,23 @@ configuring [dimensions](./data-model.md#dimensions). An example `dimensionsSpec
 }
 ```
 
+<!--Schema auto-discovery-->
+
+```json
+    "tuningConfig": {
+      ...
+      "appendableIndexSpec": {
+        ...,
+        "useSchemaDiscovery": true
+      },
+    },

Review Comment:
   no, just directly on `dimensionsSpec`, `appendableIndexSpec` is part of `tuningConfig`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org