You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2021/07/14 16:34:06 UTC

[GitHub] [druid] techdocsmith commented on a change in pull request #11428: Consolidate multi-value dimension doc and highlight configurability

techdocsmith commented on a change in pull request #11428:
URL: https://github.com/apache/druid/pull/11428#discussion_r669774384



##########
File path: docs/querying/multi-value-dimensions.md
##########
@@ -23,27 +23,52 @@ title: "Multi-value dimensions"
   -->
 
 
-Apache Druid supports "multi-value" string dimensions. These are generated when an input field contains an
-array of values instead of a single value (e.g. JSON arrays, or a TSV field containing one or more `listDelimiter`
-characters). By default Druid ingests the values in alphabetical order, see [Dimension Objects](../ingestion/index.md#dimension-objects) for configuration.
+Apache Druid supports "multi-value" string dimensions, which result from input fields that contain an
+array of values instead of a single value. 

Review comment:
       Might be good to have an example input -> Druid here (setup for tags example later?)

##########
File path: docs/querying/multi-value-dimensions.md
##########
@@ -23,27 +23,52 @@ title: "Multi-value dimensions"
   -->
 
 
-Apache Druid supports "multi-value" string dimensions. These are generated when an input field contains an
-array of values instead of a single value (e.g. JSON arrays, or a TSV field containing one or more `listDelimiter`
-characters). By default Druid ingests the values in alphabetical order, see [Dimension Objects](../ingestion/index.md#dimension-objects) for configuration.
+Apache Druid supports "multi-value" string dimensions, which result from input fields that contain an
+array of values instead of a single value. 
 
-This document describes the behavior of groupBy (topN has similar behavior) queries on multi-value dimensions when they
-are used as a dimension being grouped by. See the section on multi-value columns in
-[segments](../design/segments.md#multi-value-columns) for internal representation details. Examples in this document
+This document describes filtering and grouping behavior for multi-value dimensions. For information about the internal representation of multi-value dimensions, see
+[segments documentation](../design/segments.md#multi-value-columns). Examples in this document
 are in the form of [native Druid queries](querying.md). Refer to the [Druid SQL documentation](sql.md) for details
 about using multi-value string dimensions in SQL.
 
+## Overview
+
+At ingestion time, Druid can detect multi-value dimensions and configure the `dimensionsSpec` accordingly. It detects JSON arrays or CSV/TSV fields as multi-value dimensions.
+
+For TSV or CSV data, you can specify the multi-value delimiters using the `listDelimiter` field in the `parseSpec`. JSON data must be formatted as a JSON array to be ingested as a multi-value dimension. JSON data does not require `parseSpec` configuration.
+
+The following shows an example multi-value dimension named `tags` in a `dimensionsSpec`:
+
+```
+"dimensions": [
+  {
+    "type": "string",
+    "name": "tags",
+    "multiValueHandling": "SORTED_ARRAY",
+    "createBitmapIndex": true
+  }
+],
+```
+
+By default, Druid sorts values in multi-value dimensions. This behavior is controlled by the `SORTED_ARRAY` value of the `multiValueHandling` field. Alternatively, you can specify multi-value handling as:
+
+* `SORTED_SET`: results in the removal of duplicate values
+* `ARRAY`: retains the original order of the values
+
+See [Dimension Objects](../ingestion/index.md#dimension-objects) for information on configuring multi-value handling.
+
+
 ## Querying multi-value dimensions
 
-Suppose, you have a dataSource with a segment that contains the following rows, with a multi-value dimension
-called `tags`.
+The following sections describe filtering and grouping behavior based on the following example data, which includes a multi-value dimension, `tags`.
 
 ```
 {"timestamp": "2011-01-12T00:00:00.000Z", "tags": ["t1","t2","t3"]}  #row1

Review comment:
       todo for later prefer a "real-world" example over dummy tags.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org