You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by "clintropolis (via GitHub)" <gi...@apache.org> on 2023/05/02 00:58:26 UTC

[GitHub] [druid] clintropolis commented on a diff in pull request #14065: docs: add docs for schema auto-discovery

clintropolis commented on code in PR #14065:
URL: https://github.com/apache/druid/pull/14065#discussion_r1181974883


##########
docs/ingestion/schema-design.md:
##########
@@ -120,8 +120,8 @@ you must be more explicit. Druid columns have types specific upfront.
 
 Tips for modeling log data in Druid:
 
-* If you don't know ahead of time what columns you'll want to ingest, use an empty dimensions list to trigger
-[automatic detection of dimension columns](#schema-less-dimensions).
+* If you don't know ahead of time what columns you'll want to ingest, set `dimensionsSpec.useSchemaDiscovery` to `true` and use an empty or partially defined dimensions list to use
+[schema auto-discovery](#schema-auto-discovery-for-dimensions).

Review Comment:
   since there are two modes, I think this should just link to the schema discovery section without digging into how to enable it



##########
docs/ingestion/schema-design.md:
##########
@@ -241,12 +241,14 @@ You should query for the number of ingested rows with:
 ]
 ```
 
-### Schema-less dimensions
+### Schema auto-discovery for dimensions
+
+You can have Druid infer the schema for your data partially or fully by setting `dimensionsSpec.useSchemaDiscovery` to `true` and defining some or no dimensions in the dimensions list. For any dimensions that aren't a uniform type, Druid ingests them as JSON.
 
-If the `dimensions` field is left empty in your ingestion spec, Druid will treat every column that is not the timestamp column,
-a dimension that has been excluded, or a metric column as a dimension.
+If you do not set `dimensionsSpec.useSchemaDiscovery` to `true`, Druid can still use the string-based schemaless ingestion if either of the following conditions are met: 
 
-Note that when using schema-less ingestion, all dimensions will be ingested as String-typed dimensions.
+- The dimension list is empty 
+- You set `includeAllDimensions` to `true` 
 

Review Comment:
   While we have these two modes, I think it is worth more clearly describing them, the differences between them particularly with regards to multi-value dimensions and array types, and call out that there currently isn't an effortless migration path between the two modes.
   
   We should probably give them semi-official names, I recommend "string-based schema discovery" and "type-aware schema discovery".
   
   For string-based schema discovery, the important parts are that all columns it discovers will be either native single or multi-value `STRING`. It will only discover primitive and arrays of primitive types, and coerce them in both cases to native Druid `STRING` types. Anything that is a nested data structure, or arrays of nested data structures, like JSON objects or whatever, will be ignored and not ingested.
   
   For type-aware schema discovery, it will discover primitive types and preserve them as the most appropriate native Druid type: `STRING`, `DOUBLE`, or `LONG`. Input formats with native boolean types will be ingested as `STRING` or `LONG` depending on the value of `druid.expressions.useStrictBooleans`. Some formats have specific coercion rules for things like enums and other special types, which typically will be coerced to a `STRING`, but this is determined by the input format, not schema discovery, and is consistent with however `flattenSpec` handles these types (i'm not sure that this necessarily needs called out, since how these types of values are coerced is dependent on the underlying format).
   
   For type-aware mode, arrays of primitives in the input data will be ingested as native `ARRAY` typed columns, `ARRAY<STRING>`, `ARRAY<DOUBLE>`, or `ARRAY<LONG>`. These have `ARRAY` semantics in SQL and native queries, so are typically interacted with using the array functions (`ARRAY_` prefix) and/or `UNNEST`, in contrast with multi-value strings which can be queried using either `VARCHAR` or `ARRAY` semantics (the latter using the `MV_` functions) and have an implicit unnest, but only when grouping.
   
   Type-aware schema discovery also will ingest mixed type input, complex nested structures, and arrays of complex nested structures as `COMPLEX<json>` as a sort of catch-all. This means that unlike string-based schema discovery, type-aware schema discovery will find _all_ of the columns in the input data.
   
   The migration path we are planning between these two modes is not effortless. The biggest differences are the new mode using ARRAY types instead of MVDs, but also any mixed typed inputs might now be ingested as `COMPLEX<json>` instead of coerce to strings, which will prevent grouping (for now... this is going to change in the future). Numeric types, I think shouldn't cause too much trouble since Druid should try best effort to coerce the values as needed to successfully complete the query, but the SQL schema will change to start validating these values as numeric, so might need the occasional explicit cast.
   
   The current plan for the approach on migrating MVDs is that queries should be migrated to explicitly start using `UNNEST`, to ensure that no implicit multi-value behavior is being relied on in the query. We plan to introduce a config flag(s?) that require MVDs be handled explicitly as ARRAY types to ensure that all queries have been migrated to help with this, but haven't added such a flag for this release. Once the queries are migrated, then the new type-aware schema discovery mode can be enabled (after ensuring that any dimension exclusion lists are updated to include nested columns which previously would not have been automatically discovered, should they wish to continue excluding them). Alternatively, the new type-aware schema discovery mode can be enabled, but with explicitly defined 'string' dimensions for known MVDs or mixed type inputs which should continue to be ingested as MVDs until all of the queries can be migrated to using explicit UNNEST and/or `ARRAY_` functions.
   
   Its also worth calling out that the new type-aware schema discovery mode is experimental. I imagine it is likely we will remove the string-based schema discovery mode once the type-aware mode has matured, but until then we should clearly distinguish them.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org